library(mosaic)
library(mosaicData)
library(tidyverse)
library(car)
library(ggplot2)
library(plotly)
library(datasets)
library(DT)
library(knitr)
library(pander)
library(alr4)
library(dplyr)
library(MASS)

Midterm Questions

Question 1

Q: Perform a regression using the airquality data set in R where the response variable is Ozone and the explanatory variable is Wind.

What is the value of the MSE of this regression rounded to the nearest whole number?


A: 701

  • Squaring the residual standard error of 26.47 gives 700.6609 for the MSE. Or, computing the MSE directly gives

    sum(air.lm$res^2)/114

In either case the result is MSE = 701

air.lm <- lm(Ozone ~ Wind, data=airquality)
summary(air.lm)
## 
## Call:
## lm(formula = Ozone ~ Wind, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.572 -18.854  -4.868  15.234  90.000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  96.8729     7.2387   13.38  < 2e-16 ***
## Wind         -5.5509     0.6904   -8.04 9.27e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.47 on 114 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.3619, Adjusted R-squared:  0.3563 
## F-statistic: 64.64 on 1 and 114 DF,  p-value: 9.272e-13
sum(air.lm$res^2)/114
## [1] 700.5177



Question 2

Q: Use the RailTrail data set in R from library(mosaic) to fit a regression to the following scatterplot. Compute the value of the residual for the solid dot labeled in the plot.


A: 0.17157

  • Run these codes:

    View(RailTrail)

    • locate row 25
    • notice a hightemp = 61 and lowtemp = 35 for observation #25

    rail.lm <- lm(hightemp ~ lowtemp, data=RailTrail) rail.lm$residuals[25]

    • gives the residual for the 25th dot directly or…

    predict(rail.lm, data.frame(lowtemp=c(35)))

    • 60.82843
    • That gives the Y-hat (or predicted) high temperature in degrees Fahrenheit for a day with a low temperature of 35 degrees Fahrenheit.

To find the residual, use:

\[r_i = Y_i - \hat{Y_i} = 61 - 60.82... = 0.17157\]

rt.lm <- lm(hightemp ~ lowtemp, data = RailTrail)
plot(hightemp ~ lowtemp, data = RailTrail)
abline(rt.lm)

predict(rt.lm, data.frame(lowtemp = 35))
##        1 
## 60.82843
61 - 60.82843
## [1] 0.17157



Question 3

Q: Each of the following equations come from student submissions of the Predicting the Weather Analysis. Most of them have mistakes in them.

Which equation does not have any mistakes in it?


A: \(E{\underbrace{Y_i}_\text{High Temp}} = \beta_0 + \beta_i \underbrace{X_i}_\text{Complicated X}\)

In our course, the three symbols of \(Y_i, \ \hat{Y}_i, \ \text{and} \ E\{Y_i\}\) all represent different things. So when an equation uses the \(Y_i\) it should be of the form:

\(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0,\sigma^2)\)

When an equation uses the predicted value of it should be of the form:

\(\hat{Y}_i = b_0 + b_1 X_{i}\)

and usually we replace the \(b_0\) and \(b_1\) values with their numeric values from the “Estimate” column of the regression summary output in RStudio.

Finally, when an equation uses the \(E\{Y_i\}\) it is denoting the true regression relation and it should be of the form: \(E\{Y_i\} = \beta_0 + \beta_1 X_{i}\)



Question 4

Q: What is the value of the MSE for the simple linear regression depicted in the plot? - You shouldn’t be able to tell by just looking at it. Do some calculations. The residual for each point has been labeled for you.


A: 118.4

As the residuals are all depicted in the graph, we can calculate the SSE using:

sum( c(-6.6,13.8,-9.8,4.6,-2)^2 )

which equals 355.2.

Since the line is a “simple linear regression” line, it must use two parameters, - By visually counting, we see there are 5 dots in the plot, so n=5.

MSE = SSE/(n-p) = 355.2 / (5 - 2) = 118.4

res <- c(-6.6, 13.8, -9.8, 4.6, -2)



SSE <- sum(res^2)

n <- length(res)

MSE <- SSE / n-2
print(MSE)
## [1] 69.04



Question 5

Q: For the Residuals vs Fitted plot shown, what is the value of \(Y_i\) for the solid orange dot?


A: 45

The plot shows the solid orange dot in the plot has a fitted-value of 60 by looking at the x-axis of the plot. The vertical axis shows the residual axis which gives the point a -15 value by looking at the y-axis of the plot. Using the equation of a residual

\[ r_i = Y_i - \hat{Y}_i \\ Y_i = r_i + \hat{Y}_i \ \text{(solve for Y)} \]

shows that the y-value can be found by adding the value of the residual to the fitted value.

That gives a guess of 60 - 15 = 45 for the value of for the solid orange dot.



Question 6

Q: Suppose a researcher developed the theory that each one degree increase in daily average temperatures reduced average daily wind speed by 1/6 a mile per hour.

Use the airquality data to test the following hypothesis.

\[\underbrace{Y_i}_\text{Wind} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Temp} + \epsilon_i \text{ where } \epsilon_i \sim N(0,\sigma^2)\]

\[H_0: \beta_1 = -0.167 \\ H_a: \beta_1 \neq -0.167\]

Report the value of the p-value of the test.


A: p-value = 0.898

wind.lm <- lm(Wind ~ Temp, data = airquality)
summary(wind.lm)
## 
## Call:
## lm(formula = Wind ~ Temp, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5784 -2.4489 -0.2261  1.9853  9.7398 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.23369    2.11239  10.999  < 2e-16 ***
## Temp        -0.17046    0.02693  -6.331 2.64e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.142 on 151 degrees of freedom
## Multiple R-squared:  0.2098, Adjusted R-squared:  0.2045 
## F-statistic: 40.08 on 1 and 151 DF,  p-value: 2.642e-09
predict(wind.lm, data.frame(Temp=-0.167))
##        1 
## 23.26216
twind = (-0.17046 - -0.167) / 0.02693

2*pt(-abs(twind), 151)
## [1] 0.8979391

Question 7

Q: Continue using the regression model from the previous question on the airquality data set:

\[\underbrace{Y_i}_\text{Wind} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Temp} + \epsilon_i \text{ where } \epsilon_i \sim N(0,\sigma^2)\]

Suppose the impossible happened and May 1st of a certain year ended up being 0 degrees Fahrenheit in New York City.

What daily average wind speed (in miles per hour) does your regression model expect for such a day?


A: 23 mph

wind.lm <- lm(Wind ~ Temp, data=airquality)
coef(wind.lm)
## (Intercept)        Temp 
##  23.2336881  -0.1704644



Question 8

Q: A regression is performed, and the following output obtained. Use this output to create a 95% confidence interval for the true y-intercept (\(\beta_0\)) corresponding to this regression.


A: (1.2711, 2.6979)

(1.9845 + 5.636)*0.3521
## [1] 2.683178
(1.9845 - 5.636)*0.3521
## [1] -1.285693
1.9845 + qt(c(0.025, 0.975), 37)*0.3521
## [1] 1.271078 2.697922



Question 9

Q: Use the mtcars data set in R to perform a log(Y) transformation regression of mpg predicted by wt.

Use your transformation regression to predict the gas mileage (mpg) of a vehicle that has a weight of 2,000 lbs.

Be sure to take note of the ?mtcars help file and see the wt variable for details on how this variable is recorded.


A: 26.79846

lm1 <- lm(mpg ~ wt, data=mtcars)

boxCox(lm1)

#Suggests a 0 value is best, or log(Y) transformation.

lm2 <- lm(log(mpg) ~ wt, data=mtcars)

exp(predict(lm2, data.frame(wt=2))) #prediction transformed back
##        1 
## 26.79846



Question 10

Q: Run these codes in R:

library(mosaic) #contains the Weather data plot(high_temp ~ high_dewpt, data=Weather)

Fit the following regression model on the data.

\[\underbrace{Y_i}_\text{High Temp} = \beta_0 + \beta_1 \underbrace{X_i}_\text{High Dew Point} + \epsilon_i \text{ where } \epsilon_i \sim N(0,\sigma^2)\]

Select the plot below that correctly depicts (for the above model) the 95% prediction interval for the high temperature when the high dew point is 80.


A:

# actual

plot(high_temp ~ high_dewpt, data=Weather, ylim=c(0,120))
temp_lm <- lm(high_temp ~ high_dewpt, data=Weather)
abline(temp_lm)
predict(temp_lm, data.frame(high_dewpt=80), interval="prediction")
##        fit      lwr     upr
## 1 91.52634 74.49373 108.559
lines(c(80,80), c(74.49373, 108.559), lwd=6, col=rgb(0.2,.8,.5,.5))
lines(c(-21,80,80,-21), c(74.49373, 74.49373, 108.559, 108.559), lty=2, col="gray")



Question 11

Q: Which of the following is a “best” statement to make when interpreting a regression slope?


A: The change in the average y-value per unit change in x.

  • Incorrect answers:
    • The average change in y as x varies.
    • The average y-value when x is zero.
    • The average change in the y-value per unit change in x.
  • Further Explanation: The regression line models the average y-value for any given x-value. Since the slope is the change in the line as x changes by 1 unit, then the slope should be interpreted as the change in the average y-value (the line) for each unit change in x.



Question 12

Q: Suppose you are asked to help decide how wide (in centimeters) kid’s shoes should be for fourth grade girls with foot lengths of 22 centimeters using the KidsFeet dataset, which contains data about fourth grade girls and boys.

Perform a meaningful regression on the girls data provided in the KidsFeet data set. Select the interval below that, according to your regression, we are 95% confident that it contains 95% of the actual widths of fourth-grade girl’s feet that are 22 centimeters long.


A: We predict girls feet to be anywhere from 7.32 cm to 9.19 cm wide, when their foot length is 22 cm.

  • Incorrect answers:
    • We predict girls feet to be anywhere from 8.13 cm to 9.94 cm wide, when their foot length is 22 cm.
    • We predict girls feet to be anywhere from 8.07 cm to 9.81 cm wide, when their foot length is 22 cm.
    • We predict girls feet to be anywhere from 7.86 cm to 8.65 cm wide, when their foot length is 22 cm.
girlsFeet <- filter(KidsFeet, sex=="G")

plot(width ~ length, data=girlsFeet)

girls.lm <- lm(width ~ length, data=girlsFeet)
predict(girls.lm, data.frame(length=22), interval="prediction")
##        fit      lwr      upr
## 1 8.253979 7.320637 9.187321



Question 13

Q: The solid orange dot shown in the scatterplot below corresponds with which point on the residuals vs. fitted plot?


A: The point labeled “Chrysler Imperial”

plot(mpg ~ wt, data=mtcars, main="mtcars data set")

points(mpg ~ wt, data=mtcars[17,], pch=16, col="orange", cex=1.2)

mt.lm <- lm(mpg ~ wt, data=mtcars)

abline(mt.lm)

plot(mt.lm, which=1)

points(mt.lm$res[17] ~ mt.lm$fit[17], pch=16, cex=1.2, col="orange")



Question 14

Q: Consider the following scatterplot. Apply an appropriate Y transformation to the data shown in the plot.

State the R-squared value of the regression on the transformed data.

plot(Ozone ~ Temp, data=airquality)

Or, if you prefer, here is the plot in ggplot2:

ggplot(airquality, aes(x=Temp, y=Ozone)) + geom_point()


A: 0.5632

ggplot(airquality, aes(x=Temp, y=Ozone)) +
    geom_point()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

air3.lm <- lm(Ozone ~ Temp, data=airquality)

boxCox(air3.lm)

air3ss.lm <- lm(sqrt(sqrt(Ozone)) ~ Temp, data=airquality)
summary(air3ss.lm)
## 
## Call:
## lm(formula = sqrt(sqrt(Ozone)) ~ Temp, data = airquality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86085 -0.23007  0.00676  0.21087  1.07342 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.661530   0.254646  -2.598   0.0106 *  
## Temp         0.039362   0.003246  12.125   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3302 on 114 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.5632, Adjusted R-squared:  0.5594 
## F-statistic:   147 on 1 and 114 DF,  p-value: < 2.2e-16
air3log.lm <- lm(log(Ozone) ~ Temp, data=airquality)
summary(air3log.lm)
## 
## Call:
## lm(formula = log(Ozone) ~ Temp, data = airquality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.14469 -0.33095  0.02961  0.36507  1.49421 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.83797    0.45100  -4.075 8.53e-05 ***
## Temp         0.06750    0.00575  11.741  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5848 on 114 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.5473, Adjusted R-squared:  0.5434 
## F-statistic: 137.8 on 1 and 114 DF,  p-value: < 2.2e-16
# saunders code

lm1 <- lm(Ozone ~ Temp, data=airquality)

boxCox(lm1)

#the results of Box-Cox show a transformation should be applied, which is lambde = 0.25

air.lm.t <- lm(sqrt(sqrt(Ozone)) ~ Temp, data=airquality)

summary(air.lm.t)
## 
## Call:
## lm(formula = sqrt(sqrt(Ozone)) ~ Temp, data = airquality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86085 -0.23007  0.00676  0.21087  1.07342 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.661530   0.254646  -2.598   0.0106 *  
## Temp         0.039362   0.003246  12.125   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3302 on 114 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.5632, Adjusted R-squared:  0.5594 
## F-statistic:   147 on 1 and 114 DF,  p-value: < 2.2e-16
#Gives an R-squared of 0.5632



Question 15

Q: Open the ChickWeight data set in R.

Perform a linear regression that predicts the weight of chickens according to how old the chick is in days.

Which of the following correctly diagnoses the appropriateness of this regression?


A: Constant Variance and normal errors are both violated, but the linear relation isn’t too bad, relatively speaking.

  • Incorrect answers:
    • Linear Relation is the only assumption being violated. The data looks to have constant variance and normal errors.
    • Everything looks good except for the non-normal errors.
    • The constant variance is violated, but linearity and normality look fine.
  • Further explanation: Then, looking at the plot it should be clear that the variance of the residuals (in the vertical direction) is increasing from left to right. Also, from the Q-Q Plot, it is obvious that normality fails dramatically. Though there is a slight bend in the red smoothing line on the left of the graph, the red line is flat for most of the regression, witnessing that linearity has some problems, but nothing close to the problems with the variance and normality.
chick.lm <- lm(weight ~ Time, data = ChickWeight)


summary(chick.lm)
## 
## Call:
## lm(formula = weight ~ Time, data = ChickWeight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -138.331  -14.536    0.926   13.533  160.669 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.4674     3.0365   9.046   <2e-16 ***
## Time          8.8030     0.2397  36.725   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.91 on 576 degrees of freedom
## Multiple R-squared:  0.7007, Adjusted R-squared:  0.7002 
## F-statistic:  1349 on 1 and 576 DF,  p-value: < 2.2e-16
par(mfrow=c(1,3))

plot(chick.lm, which=1)

qqPlot(chick.lm$residuals, main="Q-Q Plot", col="steelblue3", col.lines="steelblue4", pch= 19, id=FALSE)

plot(chick.lm$residuals, main = "Residuals vs. Order")



Question 16

Q: Consider the linear model

\[Y_i = \beta_0 + \beta_1X_i+ \epsilon_i \text{ where } \epsilon_i \sim N(0,\sigma^2)\]

Which of the following is a true statement about this model?


A: \(\beta_0 + \beta_1 X_i\) describes the functional relationship of the data to the true model.

  • incorrect answers:
    • A small value for results in a very low value of \(R^2\)
    • The model is purely statistical and does not contain any functional relations.
    • \(\epsilon_i\) describes the average value of \(Y_i\) for given values of \(X_i\)



Question 17

Q: What is the value of SSE according to the output shown below?


A: Value of SSE: 25.2

Using the residual standard error: 1.875^2*22 - Squaring the residual standard error gives MSE. Multiplying MSE by (n-p) or 22 degrees of freedom gives back the SSE because MSE = SSE/(n-p) and sqrt(MSE) = residual standard error.



Question 18

Q: A researcher develops a regression model to predict the highway miles per gallon of a vehicle based on the city miles per gallon of the vehicle. They run the following codes to do this.

?mpg

View(mpg)

plot(hwy ~ cty, data = mpg)

mpg.lm <- lm(hwy ~ cty, data=mpg)

  • The researcher is wondering how to best interpret the slope from their model.

  • Select the statement below that best interprets the slope of their model.

mpg.lm <- lm(hwy ~ cty, data=mpg)
summary(mpg.lm)
## 
## Call:
## lm(formula = hwy ~ cty, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3408 -1.2790  0.0214  1.0338  4.0461 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.89204    0.46895   1.902   0.0584 .  
## cty          1.33746    0.02697  49.585   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.752 on 232 degrees of freedom
## Multiple R-squared:  0.9138, Adjusted R-squared:  0.9134 
## F-statistic:  2459 on 1 and 232 DF,  p-value: < 2.2e-16
plot(hwy ~ cty, data = mpg)
abline(mpg.lm)


A: The average highway gas mileage of vehicles increases by 1.337 miles per gallon for every 1 mpg increase in city gas mileage of the vehicle.

  • Incorrect interpretations:
    • On average, highway gas mileage of vehicles is 1.337 miles per gallon greater than the city gas mileage of the vehicle.
    • Every one mpg increase in city gas mileage of a vehicle results in 1.337 miles per gallon increase in the highway gas mileage of the vehicle.
    • The change in average highway gas mileage of vehicles increases by 1 mile per gallon for every 1.337 mpg increase in city gas mileage of the vehicle.
  • Further Explanation:
    • The slope is the change in the average Y-value for a 1 unit change in the x-value.
    • Every one mpg increase in city gas mileage of a vehicle results in 1.337 miles per gallon increase in the highway gas mileage of the vehicle. (Missing the word “average” when talking about Y.)
    • On average, highway gas mileage of vehicles is 1.337 miles per gallon greater than the city gas mileage of the vehicle. (The word “average” is incorrectly placed. Also, the slope is multiplied to the X-value, not added to it, so “1.337 miles per gallon greater than the city gas mileage” is an incorrect statement. It is 1.337 times as great in the gas mileage as the city gas mileage, plus the intercept of 0.892 miles per gallon.)
    • The change in average highway gas mileage of vehicles increases by 1 mile per gallon for every 1.337 mpg increase in city gas mileage of the vehicle. (The numbers for “1 unit” and the slope of “1.337” are misplaced.)



Question 19

Q: Select the plot below that correctly depicts the least squares regression line for the data shown. - You should be able to visually tell which one is the least squares line, just by looking at the plots.


A: The least squares line should go through the average of the Y-values as you move across the x-values. Only one plot does this visually.

  • The correct answer!

  • NOT the correct answers!!



Question 20

Q: This question goes a little beyond your current knowledge, but uses ideas you should be familiar with in the same way you have been using them previously.

What is the p-value for the missing entry in the output below? Note: the sample size for this regression was n=30.


A: where pt(estimate / std. error, n - p)*2 where p is the number of parameters being estimated by the model, which is 4. (Recognizing the 4 is the hardest part.)

tm <- -109.736 / 374.404
print(tm)
## [1] -0.2930952
pt(-abs(tm), 28)*2
## [1] 0.7716107
# correct answer:

pt(-109.736/374.404, 30-4)*2
## [1] 0.771776



Question 21

Q: Use the Orange dataset in R to perform a linear regression of the circumference (y) of an orange tree according to the age of the tree (x).

Which regression assumption is not satisfied for this regression?


A: The Q-Q Plot looks great. Linearity is okay, even though the red line is a little jagged, the points don’t seem to hold too closely to the line. Clearly the x-values are all fixed as they are “replicated” in the study. The variance shows a strong increasing trend in the vertical variability of the residuals as we move from left to right in the residuals plot.

plot(circumference ~ age, data=Orange)

lm1 <- lm(circumference ~ age, data=Orange)

plot(lm1, which=1)

qqPlot(lm1$res)

## [1] 21 27


Question 22

Q: Which of the following best describes SSR?

A: SSR: A measure of how much the regression line differs from the average of all of the y-values.



Question 23

Q: Consider transforming Y in the regression of stopping distance on speed using the cars dataset in R. Which of the following shows the “best” Y-transformation possible for this data in the original units?

carssd.lm <- lm(dist ~ speed, data = cars)

boxCox(carssd.lm)

lm.log <- lm(log(dist) ~ speed, data = cars)
b.log <- coef(lm.log)

lm.sqrt <-lm(sqrt(dist) ~ speed, data = cars)
b.sqrt <- coef(lm.sqrt)

lm.1oy <- lm(1/dist ~ speed, data = cars)
b.1oy <- coef(lm.1oy)

lm.y <- lm(dist ~ speed, data = cars)
b.y <- coef(lm.y)

lm.2 <- lm(dist^2 ~ speed, data = cars)
b.2 <- coef(lm.2)

lm.ss <- lm(sqrt(sqrt(dist)) ~ speed, data = cars)
b.ss <- coef(lm.ss)

plot(dist ~ speed, data = cars, pch=16, col="orangered", main="cars", xlab="Speed", ylab="dist")
legend("topleft", legend=c("1","2","3","4","5","6"), lty=1, col=c(1,2,3,4,5,6))

curve( (b.sqrt[1] + b.sqrt[2]*x)^2 , add=TRUE, col=2)



Question 24

Q: Consider the scatterplot shown below. Create this scatterplot in R and add the regression line to the plot.

At which age does the actual average of the dots seem to be furthest from the line?

lob.lm <- lm(height ~ age, data = Loblolly)
plot(height ~ age, data = Loblolly)
abline(lob.lm)

par(mfrow=c(1,3))

plot(lob.lm, which=1)

qqPlot(lob.lm$residuals, main="Q-Q Plot", col="steelblue3", col.lines="steelblue4", pch= 19, id=FALSE)

plot(lob.lm$residuals, main = "Residuals vs. Order")


A: Age 3 - The red line is furthest from the gray line for the first and last groups, ages 3 and 20. Since only age 3 is an option in the list above, it is the best answer.

lob.lm <- lm(height ~ age, data=Loblolly)


plot(lob.lm, which=1)
abline(lob.lm)



Question 25

Q: Consider the several scatterplots shown below of Y against three different x-variables, and each x-variable against each other. In each case, the vertical axis of the graph is the variable name to the side of the plot, while the horizontal axis is given by the variable name above or below the plot.

Which plot would have the highest R-squared value when using a simple linear regression (no transformations) in each case?


A: The plot of Y ~ X1



Final Exam Practice

Question 1

Q: Which of the following statements contains an error?


A: \(\underbrace{\hat{Y_i}}_\text{Weight} = \beta_0 + \beta_1 \underbrace{X_i}_\text{height}\)

\[E{\underbrace{Y_i}_\text{weight}} = \beta_0 + \beta_1 \underbrace{X_i}_\text{height}\]

\[\underbrace{Y_i}_\text{weight} = \beta_0 + \beta_1 \underbrace{X_i}_\text{height} + \epsilon_i \text{ where } \epsilon_i \sim N(0, \sigma^2)\]

\[\underbrace{Y_i}_\text{weight} = \beta_0 + \beta_1 \underbrace{X_i}_\text{height} + \beta_2 \underbrace{X_{2i}}_\text{age} + \epsilon_i \text{ where } \epsilon_i \sim N(0, \sigma^2)\]



Question 2

Q: Suppose a researcher from the 1920’s was studying how far it took vehicles to stop (in feet) based on the speed they were traveling (in mph) when the brakes were applied.

Suppose further that they came up with the following regression model from their study.

\[\hat{Y_i} = -17 + 4.2X_i\]

Use the “cars” data set in R to validate this regression model. What is the value of the validation R-squared for this model? (Hint: Yhat <- -17…)


A: 0.6141207

Yhat <- -17 + 4.2*cars$speed
Ybar <- mean(cars$dist)
SSE <- sum( (Yhat - cars$dist)^2 )
SSTO <- sum( (cars$dist - Ybar)^2 )
1 - SSE/SSTO
## [1] 0.6141207



Question 3

Q: Four degree 1 lowess curves are shown below, each time on the same cars data.

Which lowess curve uses 20% of the neighboring dots at each local regression to create the curve?

You’ll need to draw it yourself to really know.


A:

plot(dist ~ speed, data=cars, main="car data set")

lines(lowess(cars$speed, cars$dist, f=.2))



Question 4

Q: A logistic regression is performed to model the probability that a student will get an A in Math 325 based on their Analysis score. The following output is obtained.

Which of the following is a correct interpretation of this output?


A: The odds of a student getting an A in Math 325 triple for every extra point a student gains in their Analysis Total score.

exp(1.1423)
## [1] 3.133968



Question 5

Q: A regression is performed, and the following output obtained. What conclusion can we make from this output?


A: This regression model significantly predicts the mean Y-value, but does not do the best job at predicting the individual Y values.



Question 6

Q: Suppose a student fits the model Y ~ X2 after looking at the following pairs plot.

They then compute the residuals (here called R) from their regression and make this second pairs plot.

What should they do next?


A: They should add a quadratic term for X2 to their model.



Question 7

Q: Use the mtcars data set in R to plot the robust regression line for hp ~ wt.

Identify the robust regression line in the plot below.


A: blue line

library(MASS)

plot(hp ~ wt, data=mtcars)

abline(rlm(hp ~ wt, data=mtcars), col="dodgerblue", lwd=2) #correct answer

abline(lm(hp ~ wt, data=mtcars), col="green", lwd=2)

abline(-.5, 35, col="orange", lwd=2)

abline(-2, 50, col="firebrick", lwd=2)

mtext(side=3, text="mtcars data set")



Question 8

Q: Which logistic curve represents the males in the following plot?


A: The male line is the orange line as men are less likely to have survived the titanic than women.



Question 9

Q: Based on the information shown below, what is the predicted cost of a Toyota Corolla with 40,000 miles on it?


A: $14,510

To get the answer you must use the coefficients in the summary output below the graph.

Then the equation becomes:

\[ \text{(intercept) + mileage * x(40000 miles) + corolla + corrolla:mileage * x(40000 miles)} = 19408 - 0.1926*40000-2002+0.1202*40000 = 14510 \]



Question 10

Q: Which of the following is a “best” statement to make when interpreting a regression slope?


A: The change in the average y-value per unit change in x.



Question 11

Which of the following graphs shows a situation where it would be possible for the following numbers to occur?

SSE = 1.698519

SSR = 2730.497

SSTO = 2738.636


A:



Question 12

Use the cars data set in R to test the following hypotheses.

\[\underbrace{Y_i}_\text{dist} = \beta_0 + \beta_1 \underbrace{X_i}_\text{speed} + \epsilon_i \text{ where } \epsilon_i \sim N(0, \sigma^2)\]

\[H_0: \beta_1 = 4.2 \\ H_a: \beta_1 \neq 4.2\]

Report the p-value of your test.


A: 0.52

 summary(lm(dist ~ speed, data=cars))
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

\[t = (3.9324 - 4.2)/0.4155 = -0.6440433 \\ pt(1,48)*2 = 0.5226131\]



Question 13

Q: Estimate the coefficients in the following regression model using the airquality data set in R.

\[\underbrace{Y_i}_\text{Temp} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Month} + \beta_2 X_{i}^2 + \epsilon_i \text{ where } \epsilon_i \sim N(0, \sigma^2)\]


A: \(b_0 = -95.73, b_1 = 48.72, b_2 = -3.28\)

coef(lm(Temp ~ Month + I(Month^2), data=airquality))
## (Intercept)       Month  I(Month^2) 
##  -95.725568   48.718276   -3.282813



Question 14

Q: Which of the following graphs would support the statement that “the odds that Y = 1 increases by 20% for each 1 unit change in x”?


A:



Question 15

Q: Suppose you are asked to help decide how wide (in centimeters) kid’s shoes should be for kids with foot lengths of 25 centimeters. Suppose further you are provided with the KidsFeet dataset found in library(mosaic). Perform a regression that would help provide an answer to this question.

Select the interval below that, according to your regression, should contain 95% of the widths of kids feet that are 25 centimeters long.


A: We predict kids feet to be anywhere from 8.25 cm to 9.87 cm wide, when their foot length is 25 cm.

plot(width ~ length, data=KidsFeet)

predict(lm(width ~ length, data=KidsFeet), data.frame(length=25), interval="prediction")
##       fit      lwr      upr
## 1 9.06097 8.247227 9.874713



Question 16

Q: What is the value of the residual for the orange dot in the plot below?


A: 42.52537

The easiest way to locate this value is with the residual plot, which identifies the three largest residuals automatically.

cars.lm <- lm(dist ~ speed, data=cars)

plot(cars.lm, which=1)

From that plot, you should notice that point 23 is in the same position as the orange dot in the plot of this question. Thus,

cars.lm$residuals[23]
##       23 
## 42.52537



Question 17

Q: Which regression from the options below would best fit the data for the scatterplot shown?


A: A lowess curve.

X <- runif(30,0,50)

Y <- 20000 - 800*X + 25*X^2 + -.03*X^3 + .005*X^4 - .0002*X^5 + rnorm(30,0,1000)

plot(Y ~ X)



Question 18

Q: Return to the RailTrail data set in R.

Perform an appropriate analysis that would allow you to predict the average volume of users on the trail by knowing the cloud cover measurement for a given day.

Draw a graphic that includes the results of your analysis in the plot.

Use your graphic to estimate the average volume for days with a cloud cover measurement of 8.


A: Around 343 users on average.

Run these codes…

plot(volume ~ cloudcover, data=RailTrail)
abline(lm(volume ~ cloudcover, data=RailTrail))

# Then look at the plot above an x-value of 8.

abline(v=8, lty=2) #if this helps your eyes, then use it.

# That corresponds to a y-value of about 343

abline(h=343, lty=2)

To find the exact value, do some math:

summary(lm(volume ~ cloudcover, data=RailTrail))
## 
## Call:
## lm(formula = volume ~ cloudcover, data = RailTrail)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -217.61  -80.66   -3.98   67.71  348.66 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  461.322     25.904   17.81  < 2e-16 ***
## cloudcover   -14.797      3.905   -3.79 0.000276 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 118.9 on 88 degrees of freedom
## Multiple R-squared:  0.1403, Adjusted R-squared:  0.1305 
## F-statistic: 14.36 on 1 and 88 DF,  p-value: 0.0002757
461.322 - 14.797*8 
## [1] 342.946



Question 19

Q: The following plots show the same five points in each plot but a different line in each case. Which plot shows a regression line with the smallest SSE, i.e., the least squares regression line? - You shouldn’t be able to tell by just looking at it. Do some calculations.


A: Computing the SSE for each graph identifies the graph with:

sum(c(-7.1,13,-7.4,4.1,-2.7)^2) # = 298.27

as the line of least squares.



Question 20

Q: View the Marriage data in R.

This data set lists characteristics of 98 individuals (49 couples) at the time they were married. The age column lists the age of each individual. The prevcount column states how many times the individual was previously married prior to this marriage. So a prevcount of 2 means that person was previously married twice.

Perform an analysis using this data that looks to see if the age of the individual at the time of marriage can predict if that person was previously married.

Then, suppose that Frank is getting married and is 27 years old. Based on your analysis, what is the probability that Frank was previously married?


A: There is about a 32.3% chance that Frank was previously married.

myglm <- glm(prevcount > 0 ~ age, data=Marriage, family=binomial)

predict(myglm, data.frame(age=27), type="response")
##         1 
## 0.3232086



Question 21

Q: Run the following code in R.

plot(time ~ age, data=TenMileRace, col=sex)

plot(time ~ age, data=TenMileRace, col=sex)

Perform an appropriate analysis that would allow you to add two linear regression lines with the same slope to this plot. One line should be for the male’s data, the other for the female’s data. The common slope of the two lines should be 16.493.

Based on the results of your analysis, by how much do the average times of males and females differ?


A: By about 775.908 seconds. (look at sexM coefficient)

mylm <- lm(time ~ age + sex, data=TenMileRace)

summary(mylm)
## 
## Call:
## lm(formula = time ~ age + sex, data = TenMileRace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3032.4  -637.5   -28.8   596.5  4663.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5592.610     37.604  148.72   <2e-16 ***
## age           16.493      1.013   16.28   <2e-16 ***
## sexM        -775.908     21.478  -36.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 965.3 on 8633 degrees of freedom
## Multiple R-squared:  0.136,  Adjusted R-squared:  0.1358 
## F-statistic: 679.5 on 2 and 8633 DF,  p-value: < 2.2e-16



Question 22

Q: Use the airquality data set in R to perform a multiple regression of the following model.

\[\underbrace{Y_i}_\text{Ozone} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Wind} + \beta_2 \underbrace{X_{2i}}_\text{Temp} + \beta_3 \underbrace{X_{1i}}_\text{Wind} \underbrace{X_{2i}}_\text{Temp} + \epsilon_i\]

Based on your results in R, which of the regression lines shown on the plot below represents the line for a Temp of 70 degrees?


A: A

mylm <- lm(Ozone ~ Wind + Temp + Wind:Temp, data=airquality)

b <- coef(mylm)

plot(Ozone ~ Wind, data=airquality, cex=(Temp/max(Temp))^3+.2, xlim=c(0,22))
curve(b[1] + b[2]*x + b[3]*90 + b[4]*x*90, add=TRUE)
curve(b[1] + b[2]*x + b[3]*80 + b[4]*x*80, add=TRUE)
curve(b[1] + b[2]*x + b[3]*70 + b[4]*x*70, add=TRUE, col="blue")
curve(b[1] + b[2]*x + b[3]*100 + b[4]*x*100, add=TRUE)
text(1,29,"A")
text(1,65,"B")
text(1,105,"C")
text(1,140,"D")
points(c(18,18),c(140,160), cex=c((56/97)^3+.2,1.2))
text(x=c(18,18),y=c(140,160), pos=4, c("56 deg","97 deg"))



Question 23

Q: Select the residuals vs fitted values plot below that shows the greatest evidence of a non-linear pattern in the data.


A: This graph however is strongly supported by the data and shows a strong bend in the red line. This is a sure witness of a non-linear pattern in the data.(We are looking for a red line that curves systematically and is supported by the data.)

While this plot shows the most dramatic red line, it is least supported by the data because of the small sample size and erratic behavior of the line.



Question 24

Q: Based on the pairs plot shown below, which model would be the “best” option for this data?


A: Y ~ X1 + X2 + I(X2^2)

So the best choice would be the option that includes X1 and X2^2.



Question 25

Q: Use the mtcars data set in R to perform an appropriate transformation regression of mpg against hp.

Use your transformation regression to predict the gas mileage (mpg) of a vehicle that has 350 horse power (hp).


A: 9.587017

lm1 <- lm(mpg ~ hp, data=mtcars)

boxCox(lm1)

#Suggests a 0 value is best, or log(Y) transformation.

lm2 <- lm(log(mpg) ~ hp, data=mtcars)

exp(predict(lm2, data.frame(hp=350))) #prediction transformed back
##        1 
## 9.587017



Math 425 Topics

Week 1: Simple Linear Regression

The Normal Error Regression Model

We assume the model:
\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i, \text{ where } \epsilon_i \sim N(0, \sigma^2) \]

This means the relationship between \(X\) and \(Y\) is linear, with some random noise \(\epsilon_i\) that follows a normal distribution.

Understanding Notation

  • \(Y_i\): Actual observed response for the i-th observation
  • \(\hat{Y}_i\): Predicted response from the regression model
  • \(E\{Y_i\}\): Expected value (mean) of \(Y_i\), which equals \(\hat{Y}_i\) under the model

Simulating Data and Fitting a Linear Model

set.seed(123)
n <- 50
x <- runif(n, 0, 10)
beta0 <- 2
beta1 <- 3
sigma <- 2
epsilon <- rnorm(n, mean = 0, sd = sigma)
y <- beta0 + beta1 * x + epsilon

model <- lm(y ~ x)

plot(x, y, main = "Simple Linear Regression", pch = 19)
abline(model, col = "blue", lwd = 2)

Skill Quiz - Simple Linear Regression

Problem 1

Open the airquality dataset in R. Perform a regression of daily average Wind(\(Y_i\)) speed using the daily maximum Temperature (\(X_i\)) as the explanatory variable.


Part (a)

Type out the mathematical equation for this regression model and label both \(Y\) and \(X\) in the equation.

\[ \underbrace{Y_i}_\text{Wind} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Temp} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0,\sigma^2) \]


Part (b)

Fit and summarize a simple linear regression model for this data.

# Type your code there

templm <- lm(Wind ~ Temp, data=airquality)

summary(templm)
## 
## Call:
## lm(formula = Wind ~ Temp, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5784 -2.4489 -0.2261  1.9853  9.7398 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.23369    2.11239  10.999  < 2e-16 ***
## Temp        -0.17046    0.02693  -6.331 2.64e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.142 on 151 degrees of freedom
## Multiple R-squared:  0.2098, Adjusted R-squared:  0.2045 
## F-statistic: 40.08 on 1 and 151 DF,  p-value: 2.642e-09


Part (c)

Type out the estimated equation for this regression model using your estimates found in part (b).

\[ \hat{Y}_i = 23.234 + (-0.170)X_i \]


Part (d)

Plot the data and the estimated regression model.

# Type your code here

ggplot(airquality, aes(x=Temp, y=Wind)) +
  geom_point(size=1.5, color="lightblue", alpha= 0.5) +
  geom_smooth(method="lm", formula=y~x, se=FALSE, size=0.5,color="skyblue")+
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.


Part (e)

Use your model to predict the average daily average Wind speed for an outside temperature of 72 degrees F.

\[ \hat{Y}_i = 23.234 + (-0.170)(72) \]

# Type your code here

23.234 + (-0.170)*(72)
## [1] 10.994


Part (f)

Write out an interpretation of the slope and intercept of your model. Are both meaningful for this data?

Slope Interpretation: An increase of 1 degree F in the daily maximum Temperature(\(X_i\)) results in a 0.17 mph decrease in the average daily average Wind ($Y_i) speed.

  • Is the slope meaningful?:
    Yes, the regression seems to be a good fit to this data and the slope is significant so the interpretation of the slope is meaningful for this data.


Intercept Interpretation: When the daily maximum Temperature is 0 degrees F, the average daily average Wind speed is estimated to be 23.234 mph.

  • Is the intercept meaningful?: Yes, the intercept is meaningful for this data because it is significant and an outside temperature of 0 degrees F is a meaningful situation.



Problem 2

Open the mtcars dataset in R. Fit a regression model to the data that can be used to explain average gas mileage of a vehicle (mpg) (\(Y_i\)) using the weight (wt) (\(X_i\)) of the vehicle.


Part (a)

Type out the mathematical equation for this regression model and label both \(Y\) and \(X\) in the equation.

\[ \underbrace{Y_i}_\text{mpg} = \beta_0 + \beta_1 \underbrace{X_i}_\text{wt} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0,\sigma^2) \]


Part (b)

Fit and summarize a simple linear regression model for this data.

carslm <- lm(mpg ~ wt, data=mtcars)

summary(carslm)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10


Part (c)

Type out the estimated equation for this regression model using your estimates found in part (b).

\[ \hat{Y}_i = 37.29 + (-5.34)X_i \]


Part (d)

Plot the data and the estimated regression model.

# Type your code here

ggplot(mtcars, aes(x=wt, y=mpg)) +
  geom_point(size=1.5, color="lightblue", alpha= 0.5) +
  geom_smooth(method="lm", formula=y~x, se=FALSE, size=0.5,color="skyblue")+
  theme_minimal()


Part (e)

Use your model to predict the average gas mileage (mpg) for a vehicle that weighs 3,000 lbs. (Hint: ?mtcars)

**Must convert 3000 lbs by dividing by 1000, thus it would make it 3*

\[ \hat{Y}_i = 37.29 + (-5.34)(3) \] ANSWER:

# Type your code here
37.29 + (-5.34)*(3)
## [1] 21.27


Part (f)

Write out an interpretation of the slope and intercept of your model. Are both meaningful for this data?

Slope Interpretation: An increase of 1,000 lbs in the weight (\(X_i\)) of a vehicle results in a 5.34 mpg decrease in the average gas mileage of such vehicles(\(Y_i\)).

  • Is the slope meaningful? : While the slope is significant, it only looks to be meaningful for vehicles that weigh between 2.5 thousand and 4 thousand pounds. Otherwise this regression does not seem to be a good fit for this data based on the scatterplot.


Intercept Interpretation: The average gas mileage of vehicles that weigh nothing (0 lbs) is estimated to be 37.29 mpg.

  • Is the intercept meaningful?: No, the intercept is not meaningful for this data because a vehicle with weight zero is not possible



Before we can really trust the interpretation of and predictions from a regression model, there are important diagnostic checks to perform on the regression. These diagnostics are even more important to perform when p-values or confidence intervals are used as part of the analysis. In future weeks of this course, we will focus in greater detail on the technical details of regression: hypothesis tests, confidence intervals, and diagnostic checks. However, for the sake of completeness, the following problems have run through these technical details, even though we lack full understanding about them for the time being.


Problem 3

Use your regression for the airquality data set in Problem 1 to complete the following “technical details” for this regression.


Part (a)

Create a (1) residuals vs. fitted-values plot, (2) Q-Q Plot of the residuals, and (3) residuals vs. order plot.

# Type your code here

par(mfrow=c(1,3))

plot(templm, which=1)

qqPlot(templm$residuals, main="Q-Q Plot",pch= 19, id=FALSE)

plot(templm$residuals, ylab= "Residuals", main="Residuals vs Order")


Part (b)

Explain, as best you understand currently, what each of these three plots show for this regression.

Explanation: Everything looks pretty good. The residuals vs. fitted-values plot shows constant variance and a nice linear relation. The Q-Q Plot shows possible problems with normality because some dots go out of bounds, but is fairly good. The residuals vs. order plot shows no problems with time trends, so the error terms can be assumed to be independent.


Part (c)

Report the p-value for the test of these hypotheses for your regression.

Intercept Hypotheses

\[ H_0: \beta_0 = 0 \quad \text{vs.} \quad H_a: \beta_0 \neq 0 \]

  • P-VALUE = < 2e-16

Slope Hypotheses

\[ H_0: \beta_1 = 0 \quad \text{vs.} \quad H_a: \beta_1 \neq 0 \]

  • P-VALUE = < 2.64e-09


Comment on whether or not we should trust each p-value based on your plots in Part (a).

Should trust each p-value based on your plots in Part (a)?

  • Intercept Hypotheses p-value? : Yes, it can be trusted because the diagnostic plots all checked out.

  • Slope Hypotheses p-value? : Yes, it can be trusted because the diagnostic plots all checked out.



Problem 4

Use your regression for the mtcars data set in Problem 2 to complete the following “technical details” for this regression.


Part (a)

Create a (1) residuals vs. fitted-values plot, (2) Q-Q Plot of the residuals, and (3) residuals vs. order plot.

# Type your code here

par(mfrow=c(1,3))

plot(carslm, which=1)

qqPlot(carslm$residuals, main="Q-Q Plot",pch= 19, id=FALSE)

plot(carslm$residuals, ylab= "Residuals", main="Residuals vs Order")


Part (b)

Explain, as best you understand currently, what each of these three plots show for this regression.

Explanation: Everything looks somewhat questionable. The residuals vs. fitted-values plot shows a lack of linearity, which makes it hard to judge constant variance. The Q-Q Plot shows possible problems with normality because some dots go out of bounds, but is fairly good. The residuals vs. order plot shows a possible problem with time trends due to the slight rainbow pattern.


Part (c)

Report the p-value for the test of these hypotheses for your regression.

Intercept Hypotheses

\[ H_0: \beta_0 = 0 \quad \text{vs.} \quad H_a: \beta_0 \neq 0 \]

  • P-VALUE = < 2e-16


Slope Hypotheses

\[ H_0: \beta_1 = 0 \quad \text{vs.} \quad H_a: \beta_1 \neq 0 \]

  • P-VALUE = 1.29e-10


Comment on whether or not we should trust each p-value based on your plots in Part (a).

Should trust each p-value based on your plots in Part (a)?

  • Intercept Hypotheses p-value? : No, it should not be trusted because of the lack of linearity and the distance zero is from the current data.

  • Slope Hypotheses p-value? : No, it should not be fully trusted, though there is likely some sort of trend in the data, because of some problems in the diagnostic plots.



Assessment Quiz - Simple Linear Regression

  1. Run the following commands in R.

    library(car) View(Davis) ?Davis

Reduce the data to just the data for the males. Then perform a regression with weight as the response variable and height as the explanatory variable.

Which of the following provides the estimated average weight of males that are 180 cm tall?

Davey <- filter(Davis, sex == "M")

daveylm <- lm(weight ~ height, data=Davey)

summary(daveylm)
## 
## Call:
## lm(formula = weight ~ height, data = Davey)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.869  -6.893  -0.897   6.114  41.122 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -101.3301    29.8617  -3.393  0.00105 ** 
## height         0.9956     0.1676   5.939 5.92e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.07 on 86 degrees of freedom
## Multiple R-squared:  0.2908, Adjusted R-squared:  0.2826 
## F-statistic: 35.27 on 1 and 86 DF,  p-value: 5.922e-08

-101.33 +(0.9956)*180 =

-101.33 +(0.9956)*180
## [1] 77.878


  1. Run the following commands in R.

View(USArrests) ?USArrests

Perform a regression using this data that explains the average number of murder arrests in cities (per 100,000 in 1973) using the number of assault arrests (per 100,000) in a city.

Select the answer that provides the correct estimate of \(\beta_1\) in the formula:

\(Y_i = \beta_0 + \beta_1X_i + \epsilon_i \sim N(0, \sigma^2)\)

for the USArrests regression described above.

arrestlm <- lm(Murder ~ Assault, data=USArrests)

summary(arrestlm)
## 
## Call:
## lm(formula = Murder ~ Assault, data = USArrests)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8528 -1.7456 -0.3979  1.3044  7.9256 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.631683   0.854776   0.739    0.464    
## Assault     0.041909   0.004507   9.298  2.6e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.629 on 48 degrees of freedom
## Multiple R-squared:  0.643,  Adjusted R-squared:  0.6356 
## F-statistic: 86.45 on 1 and 48 DF,  p-value: 2.596e-12


  1. Which of the following statements is a correct statement about the graphic shown below?

Note: the “Line of Equality” shows where the line would need to be if the reported heights were equal (on average) to the actual measured heights. Dots that fall on this line show men that knew their actual height before they were officially measured.

Answer: The average actual height gets closer to the reported height for taller men, while shorter men seem more likely to under-report their height, on average.



Week 2: Residuals, Sums of Squares, and R-squared

Residuals

A residual is the difference between the actual and predicted value:
\[ r_i = Y_i - \hat{Y}_i \]

Sums of Squares

  • SSTO: Total variation
  • SSE: Error or residual variation
  • SSR: Explained variation
  • \(R^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO}\)
residuals <- resid(model)

SSTO <- sum((y - mean(y))^2)
SSE <- sum(residuals^2)
SSR <- SSTO - SSE
R2 <- SSR / SSTO

R2  # R-squared
## [1] 0.9594401
summary(model)$r.squared
## [1] 0.9594401

Skill Quiz - Residuals, Sums of Squares, and R-squared

datatable(Orange)


orangelm <- lm(circumference ~ age, data=Orange)

summary(orangelm)
## 
## Call:
## lm(formula = circumference ~ age, data = Orange)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.310 -14.946  -0.076  19.697  45.111 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.399650   8.622660   2.018   0.0518 .  
## age          0.106770   0.008277  12.900 1.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.74 on 33 degrees of freedom
## Multiple R-squared:  0.8345, Adjusted R-squared:  0.8295 
## F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14

\[\hat{Y_i} = 17.40 + 0.106770 X_i\]

ggplot(Orange, aes(x = age, y = circumference))+
  geom_point(color = "orange") +
  geom_smooth(method = "lm", formula = y~x, se = FALSE, color = "chocolate")+
  theme_minimal()

SSE <- round(sum(residuals(orangelm)^2), 2)

SSR <- round(sum((fitted(orangelm)- mean(Orange$circumference))^2), 2)

SSTO <- round(sum((Orange$circumference - mean(Orange$circumference))^2), 2)

R2 <- round(SSR / SSTO, 2) 

correlation <- sqrt(R2)
if(coef(orangelm)[2]<0) correlation <- -correlation

correlation <- round(correlation, 2)

cat("SSE:", SSE, "\n")
## SSE: 18594.74
cat("SSR:", SSR, "\n")
## SSR: 93771.54
cat("SSTO:", SSTO, "\n")
## SSTO: 112366.3
cat("R-squared (R2):", R2, "\n")
## R-squared (R2): 0.83
cat("Correlation (r):", correlation, "\n")
## Correlation (r): 0.91
predict(orangelm, data.frame(age=1095, interval="prediction"))
##        1 
## 134.3132



datatable(mtcars)
carwtlm <- lm(mpg ~ wt, data=mtcars)

carcyllm <- lm(mpg ~ cyl, data= mtcars)

carhplm <- lm(mpg ~ hp, data=mtcars)

summary(carwtlm) %>% 
  pander()
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.29 1.878 19.86 8.242e-19
wt -5.344 0.5591 -9.559 1.294e-10
Fitting linear model: mpg ~ wt
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
32 3.046 0.7528 0.7446
summary(carcyllm) %>%
  pander()
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.88 2.074 18.27 8.369e-18
cyl -2.876 0.3224 -8.92 6.113e-10
Fitting linear model: mpg ~ cyl
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
32 3.206 0.7262 0.7171
summary(carhplm) %>%
  pander()
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.1 1.634 18.42 6.643e-18
hp -0.06823 0.01012 -6.742 1.788e-07
Fitting linear model: mpg ~ hp
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
32 3.863 0.6024 0.5892
ggplot(mtcars, aes(x=wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y~x, se = FALSE, color = "red")+
  labs(title = "Explanatory variable is wt") 

ggplot(mtcars, aes(x=cyl, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y~x, se = FALSE, color = "blue")+
  labs(title = "Explanatory variable is cyl")

ggplot(mtcars, aes(x=hp, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y~x, se = FALSE, color = "green")+
  labs(title = "Explanatory variable is hp") 

# Just change the lms 
SSE.wt <- round(sum(residuals(carwtlm)^2), 2)

# y value (prediciton) goes in the mtcar$Y parts! 
SSR.wt <- round(sum((fitted(carwtlm)- mean(mtcars$mpg))^2), 2)

SSTO.wt <- round(sum((mtcars$mpg - mean(mtcars$mpg))^2), 2)

R2.wt <- round(SSR.wt / SSTO.wt, 2) 

correlation.wt <- sqrt(R2.wt)
if(coef(carwtlm)[2]<0) correlation.wt <- -correlation.wt

correlation.wt <- round(correlation.wt, 2)

cat("SSE:", SSE.wt, "\n")
## SSE: 278.32
cat("SSR:", SSR.wt, "\n")
## SSR: 847.73
cat("SSTO:", SSTO.wt, "\n")
## SSTO: 1126.05
cat("R-squared (R2):", R2.wt, "\n")
## R-squared (R2): 0.75
cat("Correlation (r):", correlation.wt, "\n")
## Correlation (r): -0.87
SSE.cyl <- round(sum(residuals(carcyllm)^2), 2)

# y value goes in the mtcar$Y parts! 
SSR.cyl <- round(sum((fitted(carcyllm)- mean(mtcars$mpg))^2), 2)

SSTO.cyl <- round(sum((mtcars$mpg - mean(mtcars$mpg))^2), 2)

R2.cyl <- round(SSR.cyl / SSTO.cyl, 2) 

correlation.cyl <- sqrt(R2.cyl)
if(coef(carcyllm)[2]<0) correlation.cyl <- -correlation.cyl

correlation.cyl <- round(correlation.cyl, 2)

cat("SSE:", SSE.cyl, "\n")
## SSE: 308.33
cat("SSR:", SSR.cyl, "\n")
## SSR: 817.71
cat("SSTO:", SSTO.cyl, "\n")
## SSTO: 1126.05
cat("R-squared (R2):", R2.cyl, "\n")
## R-squared (R2): 0.73
cat("Correlation (r):", correlation.cyl, "\n")
## Correlation (r): -0.85
SSE.hp <- round(sum(residuals(carhplm)^2), 2)

# y value goes in the mtcar$Y parts! 
SSR.hp <- round(sum((fitted(carhplm)- mean(mtcars$mpg))^2), 2)

SSTO.hp <- round(sum((mtcars$mpg - mean(mtcars$mpg))^2), 2)

R2.hp <- round(SSR.hp / SSTO.hp, 2) 

correlation.hp <- sqrt(R2.hp)
if(coef(carhplm)[2]<0) correlation.hp <- -correlation.hp

correlation.hp <- round(correlation.hp, 2)

cat("SSE:", SSE.hp, "\n")
## SSE: 447.67
cat("SSR:", SSR.hp, "\n")
## SSR: 678.37
cat("SSTO:", SSTO.hp, "\n")
## SSTO: 1126.05
cat("R-squared (R2):", R2.hp, "\n")
## R-squared (R2): 0.6
cat("Correlation (r):", correlation.hp, "\n")
## Correlation (r): -0.77

All together:

# Combine the statistics into a data frame
stats_table <- data.frame(
  Metric = c("SSE", "SSR", "SSTO", "R_squared", "Correlation"),
  wt = c(SSE.wt, SSR.wt, SSTO.wt, R2.wt, correlation.wt),
  cyl = c(SSE.cyl, SSR.cyl, SSTO.cyl, R2.cyl, correlation.cyl),
  hp = c(SSE.hp, SSR.hp, SSTO.hp, R2.hp, correlation.hp)
)

# Print the table
print(stats_table)
##        Metric      wt     cyl      hp
## 1         SSE  278.32  308.33  447.67
## 2         SSR  847.73  817.71  678.37
## 3        SSTO 1126.05 1126.05 1126.05
## 4   R_squared    0.75    0.73    0.60
## 5 Correlation   -0.87   -0.85   -0.77
# Optional: Format the table with nice rounding for display
kable(stats_table, caption = "Summary Statistics for mtcars dataset Regression Models")
Summary Statistics for mtcars dataset Regression Models
Metric wt cyl hp
SSE 278.32 308.33 447.67
SSR 847.73 817.71 678.37
SSTO 1126.05 1126.05 1126.05
R_squared 0.75 0.73 0.60
Correlation -0.87 -0.85 -0.77
par(mfrow=c(1,3))

plot(carwtlm, which=1)

qqPlot(carwtlm, id=FALSE, main= "Q-Q plot", col="darkred", col.lines = "red", pch = 16)

plot(carwtlm$residuals, main="Residuals vs Order")

par(mfrow=c(1,3))

plot(carcyllm, which=1)

qqPlot(carcyllm, id=FALSE, main= "Q-Q plot", col="darkblue", col.lines = "lightblue", pch = 16)

plot(carcyllm$residuals, main="Residuals vs Order")

par(mfrow=c(1,3))

plot(carhplm, which=1)

qqPlot(carhplm, id=FALSE, main= "Q-Q plot", col="darkgreen", col.lines = "green", pch = 16)

plot(carhplm$residuals, main="Residuals vs Order")


Assesment Quiz - Residuals, Sums of Squares, and R-squared

  1. A regression was performed for a sample of n = 5 data points.

The y-values of the regression are: 3.78, 6.08, 6.65, 9.25, and 9.92.

The residuals from the regression are: -0.266, 0.489, -0.486, 0.569, and -0.306.

What is the R-squared value for this regression?

y <- c(3.78, 6.08, 6.65, 9.25, 9.92)

SSTO <- sum( (y - mean(y))^2 )

#SSTO = 24.83372

res <- c(-0.266, 0.489, -0.486, 0.569, -0.306)

SSE <- sum(res^2)

#SSE = 0.96347

rr <- 1 - SSE/SSTO

print(rr)
## [1] 0.9612032


  1. Open the mtcars data set in R.

This data can be used to show that the displacement of the engine (disp) is positively correlated with the weight of the vehicle (wt).

What is the R-squared value of this regression?

mt.lm <- lm(disp ~ wt, data=mtcars)

summary(mt.lm)
## 
## Call:
## lm(formula = disp ~ wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -88.18 -33.62 -10.05  35.15 125.59 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -131.15      35.72  -3.672 0.000933 ***
## wt            112.48      10.64  10.576 1.22e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57.94 on 30 degrees of freedom
## Multiple R-squared:  0.7885, Adjusted R-squared:  0.7815 
## F-statistic: 111.8 on 1 and 30 DF,  p-value: 1.222e-11
  • Multiple R-squared: 0.7885


  1. Open the mtcars data set in R.

Perform a regression of mpg on weight of the vehicle (mpg ~ wt).

A certain statistics teacher at BYU-Idaho drives a 2001 Nissan Sentra that weighs approximately 2,700 lbs and currently gets only 21 mpg. Based on the regression you performed, how many mpg above or below average is this vehicle?

mt.lm <- lm(mpg ~ wt, data=mtcars) #perform the regression

plot(mpg ~ wt, data=mtcars) #draw the regression (not needed, but nice)

 abline(mt.lm) #add the regression line (not needed, but nice)
 
 points(2.7, 21, pch=16, col="skyblue") #add the Nissan Sentra value (not needed, but nice)

predict(mt.lm, data.frame(wt=2.7)) #get predicted value for Nissan Sentra
##        1 
## 22.85505
1
## [1] 1
22.85505
## [1] 22.85505
lines(c(2.7, 2.7), c(21, 22.85505), col="skyblue") #add residual line (not needed, but nice)

myresidual <- 21 - 22.85505 #calculate difference between Y and Y-hat

print(myresidual)
## [1] -1.85505
  • Residual : -1.85505



Week 3: Model Diagnostics and Transformations

Assumptions of Linear Regression

  1. Linear relationship
  2. Normal errors
  3. Constant variance
  4. Fixed X values
  5. Independent errors
  6. No major outliers

Diagnostic Plots

par(mfrow = c(2, 2))
plot(model)

Transforming the Response Variable

Sometimes transforming \(Y\) helps meet assumptions:

# Combine x and y into a clean data frame
data_clean <- data.frame(x = x, y = y)
data_clean <- na.omit(data_clean)  # Drop rows with NA

# Fit the model
model <- lm(y ~ x, data = data_clean)

# Box-Cox transformation
car::boxCox(model)

# Continue with log transformation
y_log <- log(data_clean$y)
model_log <- lm(y_log ~ data_clean$x)

# Plotting
plot(data_clean$x, y_log, main = "Log-Transformed Regression", pch = 19)
abline(model_log, col = "red", lwd = 2)



Skill Quiz - Regression Diagnositcs & Transformation

Problem 1:

Open the Davis dataset in R, found in library(car). As stated in the help file for this data set, “The subjects were men and women engaged in regular exercise.”

Perform a simple linear regression of the height of the individual based on their weight.


PART (A)

Type out the mathematical equation for this regression model and label both \(Y\) and \(X\) in the equation.

\[ \underbrace{Y_i}_\text{height} = \beta_0 + \beta_1 \underbrace{X_i}_\text{weight} + \epsilon_i \quad \text{where} \epsilon_i \sim N(0, \sigma^2) \]


PART (B)

Plot a scatterplot of the data with your regression line overlaid.

davelm <- lm(height ~ weight, data= Davis)

plot(height ~ weight, data= Davis)
abline(davelm)


PART (C)

Create a residuals vs fitted-values plot for this regression. What does this plot show?

par(mfrow=c(1,3))
plot(davelm, which=1)


What does the plot SHOW? - The point in the bottom right corner of the plot labeled “12” (observation 12) appears to be a dramatic outlier causing the regression line to be pulled towards it.


PART (D)

State and interpret the slope, y-intercept, and \(R^2\) of this regression. Are they meaningful for this data under the current regression?

summary(davelm)
## 
## Call:
## lm(formula = height ~ weight, data = Davis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.137   -4.758    0.280    6.088   25.441 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 160.09312    3.74677  42.728  < 2e-16 ***
## weight        0.15086    0.05551   2.718  0.00715 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.82 on 198 degrees of freedom
## Multiple R-squared:  0.03597,    Adjusted R-squared:  0.0311 
## F-statistic: 7.387 on 1 and 198 DF,  p-value: 0.007152


Part Meaning
Slope = 0.15 The slope is the increase int eh average height per one pound change in weight.
Y- intercept = 160.09 The y-intercept is the average height of exercising individuals who have a weight of zero (To be put nicely, a bunch of baloga)
\(R^2\) = 0.04 The proportion of variation in height explained by this regression (with the outlier included) is essentially zero


Are these values meaningful for this data under the current regression?

  • No, these values are not currently very meaningful because they are being strongly skewed by the presence of the outlier (observation number 12).


PART (E)

Run View(Davis) in your Console. What do you notice about observation #12 in this data set?

  • It looks like they accidently reverse the weight and height.

Perform a second regression for this data with observation #12 removed. Recreate the scatterplot of Part (b) with two regression lines showing this time. The first regression line should include the outlier. The second should exclude the outlier. Include a legend to show which line is which.

daniel <- filter(Davis, weight != "166")

dangdaniel <- lm(height ~ weight, data = daniel)

ggplot(Davis, aes(x = weight, y = height)) +
  geom_point(size = 1.5, shape = 19, color = "darkgrey", alpha = 1) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, aes(color = "Fitted Regression (with outlier)"), size = 1.5) +
  geom_smooth(data = daniel, aes(x = weight, y = height, color = "Fitted Regression (outlier removed)"), method = "lm", formula = y ~ x, se = FALSE, size = 1.5) +
  scale_color_manual(values = c("Fitted Regression (with outlier)" = "gray", "Fitted Regression (outlier removed)" = "skyblue")) +
  labs(color= "Regression Lines") +
  theme_classic() +
  theme(legend.position = c(0.11,0.13)) +
  labs(title = "Exercising Individuals (Davis data set)", x= "Measured Weight of Individual in kg (weight)", y= "Measured Height of Individual in cm (height)")
## Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
## 3.5.0.
## ℹ Please use the `legend.position.inside` argument of `theme()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.


PART (F)

Compute the slope, y-intercept, and \(R^2\) value for the regression with the outlier removed. compare the results to the values when the outlier was present.

summary(dangdaniel)
## 
## Call:
## lm(formula = height ~ weight, data = daniel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.347  -3.727  -0.232   3.604  20.880 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 136.83661    2.02882   67.45   <2e-16 ***
## weight        0.51689    0.03044   16.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.716 on 197 degrees of freedom
## Multiple R-squared:  0.594,  Adjusted R-squared:  0.592 
## F-statistic: 288.3 on 1 and 197 DF,  p-value: < 2.2e-16


WITHOUT outlier:

  • Slope = 0.52
  • Y-intercept = 136.84
  • \(R^2\) = 0.59

WITH outlier:

  • Slope = 0.15
  • Y-intercept = 160.09
  • \(R^2\) = 0.04


PART (G)

Create a residuals vs fitted-values plot for the regression with the outlier removed. How do things look now?

  • This plot shows possible slight problems with linearity (a slight, but consistent bend in the red line) but there do not appear to be any problems with the variance or with outliers.
par(mfrow=c(1,3))
plot(dangdaniel, which=1)



Problem 2:

Open the Prestige data set found in library(car).

Perform a regression that explains the 1971 average annual income from jobs according to their “Pineo-Porter prestige score for occupation, from a social survey conducted in the mod-1960’s.”


PART (A)

Plot the data and fitted simple linear regression line.

presto <- lm(income ~ prestige, data= Prestige)

ggplot(Prestige, aes(x = prestige, y=income)) + 
  geom_point(size = 1.5, shape = 19, color = "green", alpha = 1) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "grey")+
  labs(title= "Greater Prestige linked to Greater Income (Prestige data set)", x= "Prestige Ranking of Occupation(prestige)", y="Average Annual Income USD (income)")+
  theme_classic()

  • RSE is sigma! (apparently)


PART (B)

State the estimated values for \(\beta_0\), \(\beta_1\), and \(\sigma\) for this regression.

summary(presto)
## 
## Call:
## lm(formula = income ~ prestige, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6693.3 -1434.9   136.5  1251.8 15152.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1465.03     860.47  -1.703   0.0918 .  
## prestige      176.43      17.26  10.224   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2984 on 100 degrees of freedom
## Multiple R-squared:  0.5111, Adjusted R-squared:  0.5062 
## F-statistic: 104.5 on 1 and 100 DF,  p-value: < 2.2e-16
  • \(\beta_0\) = -1465.03
  • \(\beta_1\) = 176.43
  • \(\sigma\) = 2984


PART (C)

Create a residuals vs fitted-values plot and a Q-Q Plot of the residuals for this regression.

par(mfrow=c(1,3))
plot(presto, which=1:2)


PART (D)

Comment on any difficulties the diagnostic plots in Part (c) reveal about the regression.

  • Normality of the error terms is violated, as shown by the points going “out of bounds” in the Q-Q Plot. However, these problems are likely due to the increasing variance of the residuals shown in the residuals vs fitted-values plot. There even may be some problems with linearity of the data because of the three outliers in the top right of the graph pulling the regression line up (shown by the red line going down).


Comment on which estimates of Part (b) are likely effected by these difficulties.

  • The slope and estimate of the error variance are almost certainly being negatively effected by the increasing variance and three outliers



Problem 3:

Open the Burt data set from library(car).

This data set is famous for being fraudulent, or fake. See ?Burt for more details. One of the first indicators that it was fraudulent was revealed by regressing IQbio ~ IQfoster. This regression was just a little too good to be real. (Note that for social science data, like this data, \(R^2\) values above 0.3 are impressive. Values above 0.7 are rare.)


PART (A)

Plot the data and fitted regression line. State the estimated values of \(\beta_0\), \(\beta_1\), and \(\sigma\) as well as the \(R^2\) of the regression.

ernie <- lm(IQbio ~ IQfoster, data= Burt)

ggplot(Burt, aes(x= IQfoster, y = IQbio))+ 
  geom_point(size = 1.5, shape = 19, color = "wheat", alpha=1.5) +
  geom_smooth(method = "lm",formula = y~x, se=FALSE, color = "burlywood")+
  labs(title= "Linked IQ? \n (Burt data set)", x="Twin IQ, Raised by Foster Parents (IQfoster)", y= "Twin IQ, Raised by Biological Parents (IQbio)")+
  theme_classic()


PART (B)

Create a (1) residuals vs. fitted-values plot, (2) Q-Q Plot of the residuals, and (3) residuals vs. order plot for this regression. Are any problems with regression assumption violations visible in these plots?

par(mfrow=c(1,3))
plot(ernie, which=1:2)
plot(ernie$residuals)


PART (C)

Comment on what the three diagnostic plots of Part (b) show for the regression.

  • They show a fairly nice regression. There may be some slight difficulties with linearity due to the curved red line in the residuals vs fitted-values plot, and points 23 and 24 are somewhat far away from the rest of the data, but otherwise, things are impressively nice.

Problem 4:

Open the mtcars data set in R.

Perform a regression of mpg explained by the displacement of the vehicle’s engine.


PART (A)

Plot the data and fitted regression line. State the estimated values of \(\beta_0\), \(\beta_1\), and \(\sigma\) as well as the \(R^2\) of the regression.

haveyouseenthecarsmovie <- lm(mpg ~ disp, data= mtcars)

ggplot(mtcars, aes(x= disp, y= mpg))+
  geom_point(size = 1.5, shape = 19, color = "limegreen", alpha = 2) +
  geom_smooth(method = "lm",formula = y~x, se = FALSE, color = "limegreen") +
  labs(title = "Reduced Fuel Efficiency \n (mtcars data set)", x= "Engine Displacement cu. in. (disp)", y= "Gas Mileage (mpg)") +
  theme_classic()

summary(haveyouseenthecarsmovie)
## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8922 -2.2022 -0.9631  1.6272  7.2305 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
## disp        -0.041215   0.004712  -8.747 9.38e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared:  0.7183, Adjusted R-squared:  0.709 
## F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10
  • \(b_0\) = 29.60
  • \(b_1\) = -0.04
  • Residual Standard Error (RSE) = 3.25
  • \(R^2\) = 0.72


PART (B)

Create a (1) residuals vs. fitted-values plot, (2) Q-Q Plot of the residuals, and (3) residuals vs. order plot for this regression. Are any problems with regression assumption violations visible in these plots?

par(mfrow=c(1,3))

plot(haveyouseenthecarsmovie, which=1)

qqPlot(haveyouseenthecarsmovie$residuals, main="Q-Q Plot", col.lines="blue",pch= 19, id=FALSE)

plot(haveyouseenthecarsmovie$residuals, ylab= "Residuals", main="Residuals vs Order")


PART (C)

Comment on what the three diagnostic plots of Part (b) show for the validity of the values computed in Part (a).

  • Residual vs. Fitted plot: violates the linearity assumption meaning our data lack linearity, everything is no longer meaningful

Problem 5:

Open the Orange data set found in R.

Perform a regression that explains the circumference of the trunk of the orange tree as the tree ages.


PART (A)

Plot the data and fitted simple linear regression line.

ilikeapples <- lm(circumference ~ age, data=Orange)

ggplot(Orange, aes(x= age, y= circumference))+
  geom_point(size = 1.5, shape = 19, color = "orange", alpha = 1.5) +
  geom_smooth(method = "lm", formula = y~x, se= FALSE, color = "grey") +
  labs(title = "Growth of Orange Trees", x = "Age of Tree in Days", y = "Circumference of Tree(mm)")+ 
  theme_classic()


PART (B)

State the estimated values for \(\beta_0\), \(\beta_1\), and \(\sigma\) for this regression.

summary(ilikeapples)
## 
## Call:
## lm(formula = circumference ~ age, data = Orange)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.310 -14.946  -0.076  19.697  45.111 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.399650   8.622660   2.018   0.0518 .  
## age          0.106770   0.008277  12.900 1.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.74 on 33 degrees of freedom
## Multiple R-squared:  0.8345, Adjusted R-squared:  0.8295 
## F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14


PART (C)

Create a residuals vs fitted-values plot and a Q-Q Plot of the residuals for this regression.

par(mfrow=c(1,3))

plot(ilikeapples, which=1)

qqPlot(ilikeapples$residuals, main="Q-Q Plot", col.lines="blue",pch= 19, id=FALSE)

plot(ilikeapples$residuals, ylab= "Residuals", main="Residuals vs Order")


PART (D)

Comment on any difficulties the diagnostic plots in Part (c) reveal about the regression.

Comment on which estimates of Part (b) are likely effected by these difficulties.

Difficulties:

  • Residuals vs Fitted: megaphoning!!!!
    • RSE should not be considered meaningful as it will be too large on one end of the regression and too small on the other end


PART (E)

Perform a Box-Cox analysis of the regression. Which Y-transformation is suggested?

boxCox(ilikeapples)

  • The suggested y-transformation is 0.5
    • you will sqrt()!


PART (F)

Perform a regression with the transformed y-variable. Plot the regression in the transformed units. Diagnose the fit of the regression on the transformed data.

notmyapples <- lm(sqrt(circumference) ~ age, data=Orange)
b.sqrt <- notmyapples$coef

ggplot(Orange, aes(x= age, y= sqrt(circumference)))+
  geom_point(size = 1.5, shape = 19, color = "orange", alpha = 1.5) +
  geom_smooth(method = "lm", formula = y~x, se= FALSE, color = "grey") +
  labs(title = "Growth of Orange Trees", x = "Age of Tree in Days", y = "Circumference of Tree(mm)")+ 
  theme_classic()

par(mfrow=c(1,3))

plot(notmyapples, which=1)

qqPlot(notmyapples$residuals, main="Q-Q Plot", col.lines="blue",pch= 19, id=FALSE)

plot(notmyapples$residuals, ylab= "Residuals", main="Residuals vs Order")

Which of the following have actually become more problematic when comparing the diagnostic plots of the original regression to the diagnostic plots of the transformed regression?

  • Linearity is more violated in the transformation regression than it was in the original units
  • Normality is more violated in the transformed regression than it was in the original units


PART (G)

Write out the fitted model for \(\hat{Y}_i'\). Then solve the transformed model back into the original units for \(\hat{Y}_i\). Then compute the following.

When \(X_i = 500\), then \(\hat{Y}_i = \ldots\).

\[ \hat{Y}_i' = 5.3408 + 0.0055 X_i \]

\[ \sqrt{\hat{Y}_i'} = 5.3408 + 0.0055 X_i \]

\[ \hat{Y}_i = (5.3408 + 0.0055 X_i)^2 \] $ = $

(5.3408 + 0.0055 * 500)^2
## [1] 65.46104


PART (H)

Plot the data in the original units. Place the transformed line, back in the original units, on this plot.

plot(circumference ~ age, data=Orange, pch=16, col="orangered", main="Growth of Orange Trees", xlab="Age of Tree in Days", ylab="Circumference of Tree (mm)")

abline(ilikeapples, col= "gray")

curve( (b.sqrt[1] + b.sqrt[2]*x)^2 , add=TRUE, col="orangered")



Assesment Quiz - Diagnosing the Model and Model Transformations

  1. A regression is performed and the following plot created. Which of the following correctly shows the original scatterplot of this regression?

Residual Plot:

Original Plot:


  1. A regression is performed using \(\hat{Y_i'} = Y_i^{-2}\) and the summary output below is obtained.

Select the appropriate equation for \(\hat{Y_i}\)

Answer: \(\hat{Y_i} = \frac{1}{\sqrt{830.13 - 10.93 X_i}}\)


  1. Which of these regression functions depicted on the scatterplot is the one suggested by the Box-Cox transformation?
kachow <- lm(dist ~ speed, data = cars)

boxCox(kachow)


Base R Version:

lm.scoop <- lm(sqrt(dist) ~ speed, data = cars)
b.scoop <- coef(lm.scoop)

plot(dist ~ speed, data = cars)

curve( (b.scoop[1] + b.scoop[2]*x)^2, add=TRUE, col = "yellow")
abline(kachow)


ggplot Version:

ggplot(Orange, aes(x=age, y=circumference)) +
  geom_point(color = "orangered") +
  stat_function(fun=function(x)(b.scoop[1] + b.scoop[2]*x)^2, color= "yellow") +
  theme_classic()



Week 4: Hypothesis Testing

Testing the Slope and Intercept

  • Use the t-test:
    \[ t = \frac{b_1 - \beta_1}{SE(b_1)} \]
b1 <- coef(model)[2]
se_b1 <- summary(model)$coefficients[2, 2]
t_stat <- b1 / se_b1
df <- n - 2
p_val <- 2 * pt(-abs(t_stat), df)

t_stat
##          x 
## -0.6161284
p_val
##         x 
## 0.5407208



Skill Quiz - Hypothesis Test for Model Parameters

Problem 1 Install the Ecdat library in R: install.packages("Ecdat").

From library(Ecdat) open the Caschool data set in R. As stated in the help file for this data set, this data is a collection of measurements on 420 different school districts from California during the 1998-1999 school year.

The school districts in California offer a reduced-price lunch program. This is in a way, a measure of the poverty of the student body of the school district. We will assume that the higher the percentage of participants, the greater the general level of poverty. The question is, does the poverty level (or at least the percentage of participation in the reduced-lunch program) predict how well the student body will perform overall on a standardized test?

> ?Caschool

> View(Caschool)

Part (a)

Type out the mathematical equation for this regression model and label both \(Y\) and \(X\) in the equation.

\[ \underbrace{Y_i}_\text{test score} = \beta_0 + \beta_1 \underbrace{X_i}_\text{lunch percentage} + \epsilon_i \]

Part (b)

Plot a scatterplot of the data with your regression line overlaid. Write out the fitted regression equation.

library(Ecdat)

lunch <- lm(testscr ~ mealpct,data= Caschool)


plot(testscr ~ mealpct,data= Caschool)
abline(lunch)

\[ \hat{Y}_i = 681.43952 + (-0.61029)X_i \]

Part (c)

Report the test statistics and p-values for the following hypotheses.

\[ \begin{array}{l} H_0: \beta_0 = 0 \\ H_a: \beta_0 \neq 0 \\ \end{array} \quad \begin{array}{l} H_0: \beta_1 = 0 \\ H_a: \beta_1 \neq 0 \\ \end{array} \]

summary(lunch)
## 
## Call:
## lm(formula = testscr ~ mealpct, data = Caschool)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.539  -6.038  -0.647   6.245  35.543 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 681.43952    0.88943  766.16   <2e-16 ***
## mealpct      -0.61029    0.01701  -35.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.447 on 418 degrees of freedom
## Multiple R-squared:  0.7548, Adjusted R-squared:  0.7542 
## F-statistic:  1286 on 1 and 418 DF,  p-value: < 2.2e-16

The test statistics in this case provide the number of standard errors that the estimated values of the parameter sits from the hypothesized value of the true parameter.

Part (d)

State the slope, y-intercept, and \(R^2\) of this regression. Further, provide 95% confidence intervals for the slope and intercept. Interpret the values.

summary(lunch)
## 
## Call:
## lm(formula = testscr ~ mealpct, data = Caschool)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.539  -6.038  -0.647   6.245  35.543 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 681.43952    0.88943  766.16   <2e-16 ***
## mealpct      -0.61029    0.01701  -35.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.447 on 418 degrees of freedom
## Multiple R-squared:  0.7548, Adjusted R-squared:  0.7542 
## F-statistic:  1286 on 1 and 418 DF,  p-value: < 2.2e-16
confint(lunch, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) 679.6912170 683.1878250
## mealpct      -0.6437314  -0.5768403

As shown by the R-squared value the regression is fairly good at explaining the variablility in Y. This is further witness by how accurate the confidence intervals are for the slope and intercept.

Part (e)

Create a residuals vs fitted-values plot and Q-Q Plot of the residuals for this regression. What do these plots show?

Answer: The residauls vs fitted-values plot shows a nice linear pattern and very constant variance with the exception of three possible outliers in points 367, 163, and 180 - The Q-Q Plot shows the residuals can very safely be assumed to be normally distributed. To prove that you made the plot, the dot in the top-right of the plot is labeled as point #367

par(mfrow=c(1,3))
plot(lunch, which= 1:2)
plot(lunch$residuals)


Problem 2

Open the Clothing data set from library(Ecdat).

Although this data is from 1990, it contains two interesting variables (1) the total tsales of the clothing stores and (2) the average number of hours worked per employee during the year, hourspw.

> ?Clothing

> View(Clothing)

Part (a)

Type out the mathematical equation for this regression model and label both \(Y\) and \(X\) in the equation.

\[ \underbrace{Y_i}_\text{Total Sales} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Hours Per Employees} + \epsilon_i \]

  • \(Y_i\) : Total Sales
  • \(X_i\) : Hours Per Employee
  • $_0 $ : Average Total Sales when Zero Hours are Worked per Employee
  • \(\beta_1\) : Change in Average Total Sales as Hours Worked per Employee increases by 1
  • \(\epsilon_i\) : Each clothing companies difference from the average Total Sales

Part (b)

Plot a scatterplot of the data with your regression line overlaid. Write out the fitted regression equation.

mahclothes <- lm(tsales ~ hourspw,data= Clothing)


plot(tsales ~ hourspw,data= Clothing)
abline(mahclothes)

\[ \hat{Y}_i = b_0 + b_1 X_i + \epsilon_i \]

  • \(Y_i\) : The estimated average Total Sales.
  • \(X_i\) : The Hours Worked per Employee
  • $b_0 $ : The estimated average Total Sales when the Hours Worked per Employee is zero.
  • \(b_1\) : The estimated change in the average Total Sales per 1 hour increase in Hours Worked per Employee
  • \(\epsilon_i\) : Is not used in the fitted regression equation because it only belongs with the equation for the actual Yi values.

Part (c)

Report the test statistics and p-values given by your summary(…) in R for the following hypotheses.

\[ \begin{array}{l} H_0: \beta_0 = 0 \\ H_a: \beta_0 \neq 0 \\ \end{array} \quad \begin{array}{l} H_0: \beta_1 = 0 \\ H_a: \beta_1 \neq 0 \\ \end{array} \]

summary(mahclothes)
## 
## Call:
## lm(formula = tsales ~ hourspw, data = Clothing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -974653 -266420  -65970  131524 4721088 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1745      67479   0.026    0.979    
## hourspw        43885       3320  13.218   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 487000 on 398 degrees of freedom
## Multiple R-squared:  0.3051, Adjusted R-squared:  0.3033 
## F-statistic: 174.7 on 1 and 398 DF,  p-value: < 2.2e-16

Type your answer here…

Part (d)

Now, use your own calculations to obtain test statistics and p-values for the following hypotheses.

You may find useful information on how to do this in the “Explanation” tab under “t Tests” from your Math 325 Notebook, Simple Linear Regression page.

\[ \begin{array}{l} H_0: \beta_0 = 1500 \\ H_a: \beta_0 \neq 1500 \\ \end{array} \quad \begin{array}{l} H_0: \beta_1 = 35000 \\ H_a: \beta_1 \neq 35000 \\ \end{array} \]

Note that these hypotheses come from previous knowledge about clothing sales and employee hours. They state that in years past, the average annual sales when no employees worked any hours on average, was 1500. And that as average eployee hours worked increases by 1 hour, the average total annual sales increases by 35,000. The question now, is if the earning pattern has changed from what it used to be.

\[t = \frac{\bar{x}-\mu}{s/\sqrt{n}}\]

OR

\[t = \frac{Estimate(b_0/b_1) -\mu}{Std. Error(b_0/b_1)}\]

# Intercept
(1745 - 1500)/67479
## [1] 0.003630759
# Slope
(43885 - 35000)/3320
## [1] 2.676205
  • Intercept :
    • Test Statistic: 0.003630759
    • P-value: Large (between 0.6 to 0.999) because test statistic is very small
  • Slope :
    • Test Statistic: 2.676205
    • P-value: Small (less than 0.01) because the test statistic is fairly large

Part (e)

State the slope, y-intercept, and \(R^2\) of this regression. Further, provide 95% confidence intervals for the slope and intercept. Interpret the values.

summary(mahclothes)
## 
## Call:
## lm(formula = tsales ~ hourspw, data = Clothing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -974653 -266420  -65970  131524 4721088 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1745      67479   0.026    0.979    
## hourspw        43885       3320  13.218   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 487000 on 398 degrees of freedom
## Multiple R-squared:  0.3051, Adjusted R-squared:  0.3033 
## F-statistic: 174.7 on 1 and 398 DF,  p-value: < 2.2e-16
confint(mahclothes, level= 0.95)
##                  2.5 %    97.5 %
## (Intercept) -130915.29 134404.41
## hourspw       37357.77  50411.97

Part (f)

Create a residuals vs fitted-values plot and Q-Q Plot of the residuals for this regression. What do these plots show?

They show some possible difficulties with linearity and constant variance, but more importantly observation number 397 is an extreme outlier that should be removed. This outlier is causing the estimate of the variance of the error terms , called the MSE, to be much larger than it should be, thus causing the \(R^2\) value to be lower than it actually should be.

In fact, removing the outlier and re-running the regression increases the \(R^2\) value to 0.393. (Round to 3 decimal places.)

par(mfrow=c(1,3))
plot(mahclothes, which= 1:2)
plot(mahclothes$residuals)

Part (g)

Do any x-transformations or y-transformations improve the regression? If so, which ones?

The Box-Cox transformation suggests a 0.25 transformation on Y which is further inproved by a log transfromation on X. So the final plot looks like.

boxCox(mahclothes)

plot(tsales ~ log(hourspw),data= Clothing)



Assessment Quiz - Hypothesis Testing

Question 1: Below is the summary output from a simple linear regression performed in R. What is the value of the missing test statistic?


Answer: 3.481

  • Further Explanation: The t-value from a regression in R is found by taking the “Estimate” and dividing by the “Std. Error”. This is because R always uses the null hypothesis that the true parameter (for either the slope or the intercept) is zero. So the t-value is computed by t = (estimate - 0)/std. error = estimate/std. error
    • In this case: 4.4133 / 1.2679 = 3.480795 which rounds to 3.481.



Question 2: Below is the summary output from a simple linear regression performed in R. Compute the 95% confidence interval for the slope estimate.


Answer: (7.701, 9.423)

  • Further Explanation: Since a 95% confidence interval is obtained by the formula: estimate +/- margin of error, and the margin of error is given by (t*)(std. error) then we have:

qt(1-0.05/2, 397) #gives the critical value for t* = 1.965957

and std. error = 0.4379 #as shown in the summary output

So,

\[\underbrace{8.5621 - 1.965957 * 0.4379}_\text{lower bound}\]

\[\underbrace{8.5621 + 1.965957 * 0.4379}_\text{Upper bound}\]

If you used instead:

\[8.5621 +/- 2 * 0.4379\]

then you would have still come close to the correct answer, but it would not be as correct as using the actual critical value from the t distribution with 397 degrees of freedom.



Question 3: Suppose a 95% confidence interval for the true regression slope is obtained as (-7.453, -0.947) from a sample of 100 x-y points. What is the p-value for the test of the following hypotheses?

\[H_0 : \beta_1 = -10 \\ H_a : \beta_1 \neq -10\]


Answer: 0.0006175379

  • Further explanation: The margin of error can be found from the confidence interval by using (-7.453 - -0.947)/2 = -3.253. Similarly, the estimate of the slope can be found by finding the middle of the confidence interval (-7.453 + -0.947)/2 = -4.2.
    • Then, the critical value of the margin of error can be found using qt(1-0.05/2, 100-2) = 1.984467.
    • This allows us to recover just the standard error of the slope, which is margin of error / critical value = -3.253/1.984467 = -1.639231.
    • Thus, t = (estimate - hypothesized)/std. error = (-4.2 - -10)/-1.639231 = -3.538244, and the corresponding p-value is clearly very small, but can be computed exactly by pt(-3.538244, 98)*2 which gives p-value = 0.0006175379.




Week 5: Confidence and Prediction Intervals

Confidence vs Prediction

  • Confidence Interval: Mean response at a given \(X_h\)
  • Prediction Interval: Individual new response at \(X_h\)



Skill Quiz - Confidence and Prediction Intervals

Problem 1

Install the alr3 library in R: install.packages("alr3").

From library(alr3) open the BGSall data set in R. As stated in the help file for this data set, this data is a collection of measurements on “children born in 1928-29 in Berkeley, CA.”

> ?BGSall

> View(BGSall)

A standing tradition is that if you measure a child when they are 2-years old, and double their height, this will give a good prediction on their final adult height. Let’s see if this is true.

Perform a regression that could predict a child’s 18-year old height from their 2-year old height.

Part (a)

Type out the mathematical equation for this regression model and label both \(Y\) and \(X\) in the equation.

\[ \underbrace{Y_i}_\text{Height 18} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Height 2} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \sigma^2) \]

Part (b)

Plot a scatterplot of the data with your regression line overlaid. Write out the fitted regression equation.

tall <- lm(HT18 ~ HT2, data= BGSall)

plot(HT18 ~ HT2, data= BGSall)
abline(tall)

\[ \hat{Y}_i = 45.7966 + 1.441 X_i \]

Part (c)

Report the test statistics and p-values for the following hypotheses. (The hypotheses about \(\beta_0\) claim that at age 0-years, a child should have height 0 cm. The hypotheses about \(\beta_1\) claim that height doubles from age 2 to 18.)

\[ \begin{array}{l} H_0: \beta_0 = 0 \\ H_a: \beta_0 \neq 0 \\ \end{array} \quad \begin{array}{l} H_0: \beta_1 = 2 \\ H_a: \beta_1 \neq 2 \\ \end{array} \]

summary(tall)
## 
## Call:
## lm(formula = HT18 ~ HT2, data = BGSall)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5406  -5.9807  -0.3401   5.7771  17.3163 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.7966    16.7006   2.742  0.00694 ** 
## HT2           1.4441     0.1901   7.597 4.67e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.422 on 134 degrees of freedom
## Multiple R-squared:  0.301,  Adjusted R-squared:  0.2958 
## F-statistic: 57.71 on 1 and 134 DF,  p-value: 4.67e-12

Type your answer here…

Part (d)

State the slope, y-intercept, and \(R^2\) of this regression. Further, provide 95% confidence intervals for the slope and intercept. Interpret the values.

summary(tall)
## 
## Call:
## lm(formula = HT18 ~ HT2, data = BGSall)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5406  -5.9807  -0.3401   5.7771  17.3163 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.7966    16.7006   2.742  0.00694 ** 
## HT2           1.4441     0.1901   7.597 4.67e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.422 on 134 degrees of freedom
## Multiple R-squared:  0.301,  Adjusted R-squared:  0.2958 
## F-statistic: 57.71 on 1 and 134 DF,  p-value: 4.67e-12
confint(tall, level=0.95)
##                 2.5 %    97.5 %
## (Intercept) 12.765715 78.827522
## HT2          1.068108  1.820011
  • \(R^2\): confirms what we see in the scatterplot, that even though the regression is useful, there is still quite a bit of variability in the data being unexplained.

Part (e)

Create a residuals vs fitted-values plot and Q-Q Plot of the residuals for this regression. What do these plots show?

par(mfrow=c(1,3))
plot(tall, which= 1:2)

Part (f)

A certain stats teacher at BYU-Idaho has a son named Andrew who turns 2-years old in a couple of weeks. He is currently 2 feet 9 inches tall. Predict his 18-year old height in centimeters with 95% confidence.

predict(tall, data.frame(HT2=83.82), interval="prediction")
##        fit      lwr     upr
## 1 166.8377 152.0294 181.646

Problem 2

Open the wblake data set from library(alr4).

> ?wblake

> View(wblake)

If you love fishing, then you might like this data. The goal of this data was to see if there was a link in the radius size of a key “Scale” of a fish and the overall “Length” of the fish.

Part (a)

Type out the mathematical equation for this regression model and label both \(Y\) and \(X\) in the equation.

\[ \underbrace{Y_i}_\text{Scale} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Length} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \sigma^2) \]

Part (b)

Plot a scatterplot of the data with your regression line overlaid. Write out the fitted regression equation. State the \(R^2\) value of the regression.

fishy.lm <- lm(Scale ~ Length, data = wblake)

summary(fishy.lm)
## 
## Call:
## lm(formula = Scale ~ Length, data = wblake)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9240 -0.5011 -0.1639  0.1910  3.9795 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.4307592  0.1356603  -10.55   <2e-16 ***
## Length       0.0378026  0.0006644   56.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9288 on 437 degrees of freedom
## Multiple R-squared:  0.8811, Adjusted R-squared:  0.8808 
## F-statistic:  3237 on 1 and 437 DF,  p-value: < 2.2e-16

\[ \underbrace{Y_i}_\text{Scale} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Length} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \sigma^2) \]

plot(Scale ~ Length, data = wblake)

Part (c)

Diagnose the regression with a residuals vs. fitted-values plot. Determine which Y-transformation is suggested for this data.

par(mfrow=c(1,2))

plot(fishy.lm, which=1)

qqPlot(fishy.lm$residuals, main="Q-Q Plot", col="darkolivegreen", col.lines="darkgreen",pch= 19, id=FALSE)

boxCox(fishy.lm)

Part (d)

Perform a regression of the form \(Y' = Y^\lambda\). Use your answer to Part (c) to select \(\lambda\).

Plot the regression in the transformed space, \(Y' \sim X\) and add the fitted regression to the plot.

plot(sqrt(Scale) ~ Length, data = wblake)

Part (e)

State the slope, y-intercept, and \(R^2\) of this transformed regression.

fisher.lm <- lm(sqrt(Scale) ~ Length, data = wblake)

summary(fisher.lm)
## 
## Call:
## lm(formula = sqrt(Scale) ~ Length, data = wblake)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55799 -0.11064 -0.03639  0.05285  0.62613 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.7881700  0.0263782   29.88   <2e-16 ***
## Length      0.0081115  0.0001292   62.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1806 on 437 degrees of freedom
## Multiple R-squared:  0.9002, Adjusted R-squared:    0.9 
## F-statistic:  3942 on 1 and 437 DF,  p-value: < 2.2e-16

Part (f)

Create a residuals vs fitted-values plot and Q-Q Plot of the residuals for this trasnformed regression. Does this regression look better than the original?

par(mfrow=c(1,2))

plot(fisher.lm, which=1)

qqPlot(fisher.lm$residuals, main="Q-Q Plot", col="darkolivegreen", col.lines="darkgreen",pch= 19, id=FALSE)

Part (g)

Untransform the fitted regression equation and draw it on a scatterplot of the original data. Include the original regression line on this plot.

# Type your code here...

Part (h)

Place two prediction intervals for the Scale radius when the Length of the fish is 250 on your scatterplot of the data in the original units.

  1. Show the prediction interval from the original regression in the original units.

  2. Show the prediction interval from the transformed regression back on the original units.

predict(fishy.lm,data.frame(Length = 250), interval="prediction")
##        fit     lwr      upr
## 1 8.019889 6.19091 9.848869
predict(fisher.lm,data.frame(Length = 250), interval="prediction")^2
##       fit      lwr     upr
## 1 7.93008 6.053607 10.0595

Week 6: Qualitative Predictors and Interaction Terms

Dummy Variables and the “Two Lines” Model

We can include categorical variables in a regression model using dummy (0/1) variables.

set.seed(42)
group <- rep(c("A", "B"), each = 50)
x <- c(rnorm(50, 5, 1), rnorm(50, 5, 1))
y <- c(2 + 3 * x[1:50] + rnorm(50), 4 + 1.5 * x[51:100] + rnorm(50))
data <- data.frame(x, group, y)

model_two_lines <- lm(y ~ x * group, data = data)
summary(model_two_lines)
## 
## Call:
## lm(formula = y ~ x * group, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.72823 -0.55004  0.00611  0.45243  3.07342 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.4315     0.5764   2.484  0.01473 *  
## x             3.0840     0.1132  27.255  < 2e-16 ***
## groupB        2.9073     0.9301   3.126  0.00235 ** 
## x:groupB     -1.6551     0.1807  -9.160 9.47e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9121 on 96 degrees of freedom
## Multiple R-squared:  0.9444, Adjusted R-squared:  0.9427 
## F-statistic:   544 on 3 and 96 DF,  p-value: < 2.2e-16
library(ggplot2)
ggplot(data, aes(x = x, y = y, color = group)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'



Skill Quiz - Confidence and Prediction Intervals

Problem 1

Consider the scatterplot shown here of a single residence’s monthly gas bill according to the month of the year. See ?Utilities for more details on the data.

plot(gasbill ~ month, data=Utilities, main="Single Residence in Minnesota", xlab="Month of the Year", ylab="Monthly Gas Bill (US Dollars)")

Part (a)

Add the estimated regression function to the scatterplot above and state the function below.

  • Which of the following regression models seem most appropriate to fit to this data?
    • A Quadratic Regression Model
lm.gassy <- lm(gasbill ~ month + I(month^2), data=Utilities)

summary(lm.gassy)
## 
## Call:
## lm(formula = gasbill ~ month + I(month^2), data = Utilities)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -117.440  -15.127   -0.692   10.528   86.617 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 284.4558    10.5374   27.00   <2e-16 ***
## month       -76.6165     3.6609  -20.93   <2e-16 ***
## I(month^2)    5.4807     0.2712   20.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.48 on 114 degrees of freedom
## Multiple R-squared:  0.7937, Adjusted R-squared:  0.7901 
## F-statistic: 219.3 on 2 and 114 DF,  p-value: < 2.2e-16
b <- coef(lm.gassy)

plot(gasbill ~ month, data=Utilities, main="Single Residence in Minnesota", xlab="Month of the Year", ylab="Monthly Gas Bill (US Dollars)")

curve(b[1] + b[2]*x + b[3]*x^2, col="skyblue", lwd=2, add=TRUE)

\[ \hat{Y}_i = ... \]

Part (b)

Diagnose the appropriateness of this regression model. How well does it fit the data?

Be sure to provide diagnostic plots and supporting arguments for your claims.

par(mfrow=c(1,3))

plot(lm.gassy, which=1)

qqPlot(lm.gassy, main= "Q-Q plot", pch = 16)
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## [1] 2 4
plot(lm.gassy$residuals, main="Residuals vs Order")

Part (c)

What range of possible gas bill amounts do you predict for the September bill? How confident are you in your prediction?

predict(lm.gassy, data.frame(month = 9),interval = "prediction")
##        fit       lwr      upr
## 1 38.84658 -22.00324 99.69641

Type your answer here…

Part (d)

Compute the mean of just September’s gas bills and draw a horizontal reference line for this mean on your scatterplot. How does this mean compare to your prediction in Part (c)?

sepbill <- Utilities %>%
  filter(month == 9) 

mean(sepbill$gasbill, na.rm=TRUE)
## [1] 22.815

Type your answer here…

Part (e)

State the values of \(R^2\), \(MSE\), and the residual standard error. What do each of these values tell us about the regression model?

summary(lm.gassy)
## 
## Call:
## lm(formula = gasbill ~ month + I(month^2), data = Utilities)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -117.440  -15.127   -0.692   10.528   86.617 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 284.4558    10.5374   27.00   <2e-16 ***
## month       -76.6165     3.6609  -20.93   <2e-16 ***
## I(month^2)    5.4807     0.2712   20.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.48 on 114 degrees of freedom
## Multiple R-squared:  0.7937, Adjusted R-squared:  0.7901 
## F-statistic: 219.3 on 2 and 114 DF,  p-value: < 2.2e-16

Type your answer here…


Problem 2

View the mtcars dataset and corresponding help file ?mtcars.

Perform a regression that predicts the miles per gallon mpg of the vehicle based on the quarter mile time qsec and transmission type am of the vehicle.

Part (a)

Plot the data and your fitted regression model.

vroom.lm <- lm(mpg ~ qsec, data=mtcars)

b <- coef(vroom.lm)

ggplot(mtcars, aes(y=mpg, x=qsec, color=factor(am))) + 
  geom_point(size = 1) +
  geom_smooth(method = "lm", aes(group = as.factor(am)), se= FALSE) +
  scale_color_manual(name = "am", values = c("skyblue", "orange"))+ theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

Part (b)

State the fitted regression model.

vroom.lm <- lm(mpg ~ qsec + am + qsec:am, data = mtcars)

summary(vroom.lm)
## 
## Call:
## lm(formula = mpg ~ qsec + am + qsec:am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4551 -1.4331  0.1918  2.2493  7.2773 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -9.0099     8.2179  -1.096  0.28226   
## qsec          1.4385     0.4500   3.197  0.00343 **
## am          -14.5107    12.4812  -1.163  0.25481   
## qsec:am       1.3214     0.7017   1.883  0.07012 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.343 on 28 degrees of freedom
## Multiple R-squared:  0.722,  Adjusted R-squared:  0.6923 
## F-statistic: 24.24 on 3 and 28 DF,  p-value: 6.129e-08

\[ ... \]

Part (c)

Use summary(...) to perform an appropriate t test to determine if the interaction term is needed in this regression model.

par(mfrow=c(1,3))

plot(vroom.lm, which=1)

qqPlot(vroom.lm, main= "Q-Q plot", pch = 16)
## Lotus Europa   Volvo 142E 
##           28           32
plot(vroom.lm$residuals, main="Residuals vs Order")

Type your answer here…

Part (d)

Diagnose the appropriateness of this regression model. How well does it fit the data?

Be sure to provide diagnostic plots and supporting arguments for your claims.

#Type your code here...

Type your answer here…

Part (e)

State the \(R^2\) and residual standard error values for this model. What do these values show?

#Type your code here...

Type your answer here…


Part (f)

Sometimes, when an interaction term is not significant, we may choose to use a reduced regression model that drops the interaction term. Mathematically this would turn the full two-lines model into this simpler mathematical model:

Note the absence of the term in this reduced two-lines model.

Fit this reduced model on the mtcars data from above. State the fitted regression model and plot the model.

vroomvroom.lm <- lm(mpg ~ qsec + am, data = mtcars)

summary(vroomvroom.lm)
## 
## Call:
## lm(formula = mpg ~ qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3447 -2.7699  0.2938  2.0947  6.9194 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18.8893     6.5970  -2.863  0.00771 ** 
## qsec          1.9819     0.3601   5.503 6.27e-06 ***
## am            8.8763     1.2897   6.883 1.46e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.487 on 29 degrees of freedom
## Multiple R-squared:  0.6868, Adjusted R-squared:  0.6652 
## F-statistic:  31.8 on 2 and 29 DF,  p-value: 4.882e-08

(Note that the interaction term is sometimes called the “change in slope term” and thus, once it is removed from the model, there is no more possibility for the two lines to have different slopes. The are forced to be parallel.)

b <- coef(vroomvroom.lm)

ggplot(mtcars, aes(y=mpg, x=qsec, color=factor(am))) + 
  geom_point(size = 1) +
  geom_smooth(method = "lm", aes(group = as.factor(am)), se= FALSE) +
  scale_color_manual(name = "am", values = c("coral", "aquamarine3"))+ theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

Problem 3

View the mtcars dataset and corresponding help file ?mtcars.

Perform a regression that predicts the quarter mile time qsec of the vehicle based on the displacement of the engine disp and transmission type am of the vehicle according to the model:

\[ \underbrace{Y_i}_\text{qsec} = \beta_0 + \beta_1 \underbrace{X_{1i}}_\text{disp} + \beta_2 X_{1i}^2 + \beta_3 \underbrace{X_{2i}}_\text{am == 1} + \beta_4 X_{1i}X_{2i} + \beta_5 X_{1i}^2X_{2i} + \epsilon_i \]

where \(\epsilon_i \sim N(0, \sigma^2)\).

Part (a)

Plot the data and your fitted regression model.

mtcars$disp2 <- mtcars$disp^2

drivin.lm <- lm(qsec ~ disp + disp2 + am + disp:am + disp2:am, data = mtcars)

summary(drivin.lm)  
## 
## Call:
## lm(formula = qsec ~ disp + disp2 + am + disp:am + disp2:am, data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.59239 -0.58978  0.02637  0.43534  2.35282 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.619e+01  1.728e+00  15.155 2.02e-14 ***
## disp        -4.937e-02  1.270e-02  -3.888 0.000626 ***
## disp2        6.607e-05  2.133e-05   3.097 0.004646 ** 
## am          -3.663e+00  2.343e+00  -1.563 0.130076    
## disp:am     -2.926e-03  2.335e-02  -0.125 0.901222    
## disp2:am     1.866e-05  5.143e-05   0.363 0.719707    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.033 on 26 degrees of freedom
## Multiple R-squared:  0.7197, Adjusted R-squared:  0.6658 
## F-statistic: 13.35 on 5 and 26 DF,  p-value: 1.695e-06
b <- coef(drivin.lm)

ggplot(mtcars, aes(x = disp, y = qsec, color = factor(am))) +
  geom_point(size = 2) +
  stat_function( fun = function(x) b[1] + b[2]*x + b[3]*x^2, color= "skyblue") + 
  stat_function( fun = function(x) b[3] + b[4]*x + b[3]*x^2, color= "orange")+
  scale_color_manual(name = "Transmission Type", values = c("skyblue", "orange"),
                     labels = c("Automatic", "Manual")) +
  theme_classic() +
  labs(title = "Regression of qsec on disp and Transmission Type",
       x = "Displacement (disp)",
       y = "Quarter Mile Time (qsec)")

Type your answer here…

Part (b)

State the fitted regression model.

\[ \hat{Y}_i = ... \]

Part (c)

Use summary(...) to perform appropriate t tests to determine which interaction terms are needed in this regression model.

#Type your code here...

Part (d)

Diagnose the appropriateness of this regression model. How well does it fit the data?

Be sure to provide diagnostic plots and supporting arguments for your claims.

drivin.lm <- lm(qsec ~ disp + disp2 + am + disp:am + disp2:am, data = mtcars)

par(mfrow=c(1,3))

plot(drivin.lm, which=1)

qqPlot(drivin.lm, main= "Q-Q plot", pch = 16)
##  Valiant Merc 230 
##        6        9
plot(drivin.lm$residuals, main="Residuals vs Order")

Type your answer here…

Part (e)

Drop the “least significant” term from the model. This is a silly statement really, but is an accepted practice when searching for a “best” model in regression analysis. What has changed?

WITH “least significant term”

mtcars$disp2 <- mtcars$disp^2

drivin.lm <- lm(qsec ~ disp + disp2 + am + disp:am + disp2:am, data = mtcars)

summary(drivin.lm)
## 
## Call:
## lm(formula = qsec ~ disp + disp2 + am + disp:am + disp2:am, data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.59239 -0.58978  0.02637  0.43534  2.35282 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.619e+01  1.728e+00  15.155 2.02e-14 ***
## disp        -4.937e-02  1.270e-02  -3.888 0.000626 ***
## disp2        6.607e-05  2.133e-05   3.097 0.004646 ** 
## am          -3.663e+00  2.343e+00  -1.563 0.130076    
## disp:am     -2.926e-03  2.335e-02  -0.125 0.901222    
## disp2:am     1.866e-05  5.143e-05   0.363 0.719707    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.033 on 26 degrees of freedom
## Multiple R-squared:  0.7197, Adjusted R-squared:  0.6658 
## F-statistic: 13.35 on 5 and 26 DF,  p-value: 1.695e-06

WITHOUT least significant term

mtcars$disp2 <- mtcars$disp^2

swervin.lm <- lm(qsec ~ disp + disp2 + am + disp2:am, data = mtcars)

summary(swervin.lm)
## 
## Call:
## lm(formula = qsec ~ disp + disp2 + am + disp2:am, data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.57899 -0.56846  0.03619  0.45453  2.33218 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.630e+01  1.442e+00  18.246  < 2e-16 ***
## disp        -5.024e-02  1.046e-02  -4.803 5.18e-05 ***
## disp2        6.750e-05  1.768e-05   3.817 0.000717 ***
## am          -3.939e+00  7.840e-01  -5.024 2.86e-05 ***
## disp2:am     1.238e-05  1.143e-05   1.083 0.288546    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.014 on 27 degrees of freedom
## Multiple R-squared:  0.7196, Adjusted R-squared:  0.678 
## F-statistic: 17.32 on 4 and 27 DF,  p-value: 3.766e-07

Problem 4

set.seed(101) 

n <- 100 

X_i <- runif(n, -2, 2)
X_2i <- sample(c(0,1), n, replace=TRUE)


beta0 <- 2
beta1 <- 5
beta2 <- -3
beta3<- -4
beta4<- -2
beta5<- 7

sigma <- 2

epsilon_i <- rnorm(n, 0, sigma) 

  
Y_i <- beta0 + beta1*X_i + beta2*(X_i)^2 + beta3*X_2i + beta4*X_i*X_2i + beta5*(X_i)^2*X_2i + epsilon_i

fabData <- data.frame(x = X_i, y = Y_i, x2 = X_2i)


fab.lm <- lm(y ~ x + I(x^2) + x2 + x:x2 + I(x^2):x2, data = fabData)

b <- coef(fab.lm)


# betas are the solid, the coefficients from the lm are the dashed
ggplot(fabData, aes(x = x, y = y, color = as.factor(x2))) +
  geom_point(size = 1, pch=19) +
  scale_color_manual(name = "Groups", values = c("firebrick", "midnightblue")) +
  stat_function(fun = function(x) beta0 + beta1*x + beta2*x^2, color = "firebrick", linetype = "solid", size = 1) + 
  stat_function(fun = function(x) (beta3 + beta0) + (beta4 + beta1)*x + (beta5 + beta2)*x^2, color = "midnightblue", linetype = "solid", size = 1) +
  stat_function(fun = function(x) b[1] + b[2]*x + b[3]*x^2, color = "firebrick", linetype = "dashed", size = 1) + 
  stat_function(fun = function(x) (b[1] + b[4]) + (b[2] + b[5])*x + (b[3] + b[6])*x^2, color = "midnightblue", linetype = "dashed", size = 1) +
  theme_minimal() +
  labs(title = "Quadratic Regression for Two Models",
       x = "X Values",
       y = "Y Values")



Assessment Quiz - Hypothesis Testing

Question 1: What is the equation of the regression function shown in this plot?

## Code to make this graphic...

ggplot(Wages1, aes(x=school, y=wage)) + geom_point(color=rgb(.3,.3,.2,.4)) + geom_smooth(method="lm", formula= y ~ poly(x, 2, raw=TRUE), se=F, col=rgb(.3,.3,.2,.8)) + theme_bw() + labs(title="Best wages for highest schooling... or no schooling?", subtitle="Wages1 data, library(Ecdat)", x="Number of years of Schooling (school)", y="Current Hourly Wage in 1980's USD (wage)")

## Code needed to obtain the solution...

summary(lm(wage ~ school + I(school^2), data=Wages1))
## 
## Call:
## lm(formula = wage ~ school + I(school^2), data = Wages1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.337 -1.998 -0.474  1.444 34.554 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.13334    1.46274   4.193 2.82e-05 ***
## school      -0.68354    0.25743  -2.655  0.00796 ** 
## I(school^2)  0.05488    0.01129   4.859 1.23e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.126 on 3291 degrees of freedom
## Multiple R-squared:  0.08636,    Adjusted R-squared:  0.0858 
## F-statistic: 155.5 on 2 and 3291 DF,  p-value: < 2.2e-16
## Code to draw the plot in base graphics...

plot(wage ~ school, data=Wages1, pch=16, col=rgb(.3,.3,.2,.4), main="Best wages for highest schooling... or no schooling?",  xlab="Number of years of Schooling (school)", ylab="Current Hourly Wage in 1980's USD (wage)")

mtext(side=3, text="Wages1 data, library(Ecdat)")

abline(v=seq(2,16,2), h=seq(0,40,10), lty=2, col=rgb(.2,.2,.2,.2))

wages.lm <- lm(wage ~ school + I(school^2), data=Wages1)

b <- coef(wages.lm)

curve(b[1] + b[2]*x + b[3]*x^2, add=TRUE, col=rgb(.3,.3,.2,.8), lwd=2)


Question 2: Which of the following graphics correctly depicts the regression model stated here.

\[Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \epsilon_i \text{ where } X_{2i} = \left\{\begin{array}{ll} 1, & \text{Blue Group} \\ 0, & \text{Orange Group} \end{array}\right. \]

Since the equation only allows the y-intercept to be changed by the \(\beta_2\) term, and there is no interaction term allowing the slope to change, then the correct answer is the plot that has lines with equal slopes. In other words, this model forces parallel lines. Only models that include an interaction term can have differing slopes.

Question 3: Open the Loblolly data set in R.

Which regression model shows the best linearity according to the residuals vs fitted values plot of that regression?

# Running the codes:

plot(lm(height ~ age + I(age^2), data=Loblolly), which=1)

plot(lm(height ~ age + I(age^2) + I(age^3), data=Loblolly), which=1)

plot(lm(height ~ age + I(age^2) + I(age^3) + I(age^4), data=Loblolly), which=1)

plot(lm(height ~ age + I(age^2) + I(age^3) + I(age^4) + I(age^5), data=Loblolly), which=1)

# will show the last one has the "flattest" red line, witnessing the greatest "linear" fit of the data. If you wanted to resolve the lack of constant variance, you could then apply: 

boxCox(lm(height ~ age + I(age^2) + I(age^3) + I(age^4) + I(age^5), data=Loblolly))

# and the resulting transformation looks quite nice:

plot(lm(sqrt(sqrt(height)) ~ age + I(age^2) + I(age^3) + I(age^4) + I(age^5), data=Loblolly), which=1)

# except linearity is still lacking...

plot(lm(sqrt(sqrt(height)) ~ age + I(age^2) + I(age^3) + I(age^4) + I(age^5), data=Loblolly), which=2)



Week 7: Lowess Curves & Model Selection

Using R-squared and Adjusted R-squared

model_simple <- lm(y ~ x, data = data)
summary(model_simple)$r.squared
## [1] 0.3821555
summary(model_simple)$adj.r.squared
## [1] 0.3758509

Lowess Curve

ggplot(data, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.5, se = FALSE, color = "blue")
## `geom_smooth()` using formula = 'y ~ x'



Skill Quiz - Lowess Curves and Model Selection

Problem 1

MODEL 1

Fit this model to the KidsFeet dataset. Use your model to answer the questions and re-create the graph shown below.

\[\overbrace{Y_i}^{length} = \beta_0 + \beta_1 \overbrace{X_{i1}}^{width} + \beta_2 X_{i1}^2 + \epsilon_i\]

  • What type of regression model is stated above for \(Y_i\)? : Quadratic Model
feet.lm <- lm(length ~ width + I(width^2), data = KidsFeet)

summary(feet.lm)
## 
## Call:
## lm(formula = length ~ width + I(width^2), data = KidsFeet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6131 -1.0162  0.2720  0.6543  2.0057 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  22.9167    43.9401   0.522    0.605
## width        -1.2848     9.8528  -0.130    0.897
## I(width^2)    0.1647     0.5512   0.299    0.767
## 
## Residual standard error: 1.038 on 36 degrees of freedom
## Multiple R-squared:  0.4125, Adjusted R-squared:  0.3798 
## F-statistic: 12.64 on 2 and 36 DF,  p-value: 6.961e-05
b <- coef(feet.lm)

plot(length ~ width, data = KidsFeet, pch=19) 
curve(b[1] + b[2]*x + b[3]*x^2, col = "black", lwd = 2, add=TRUE)


MODEL 2

Fit this model to the KidsFeet dataset. Use your model to answer the questions and re-create the graph shown below.

\[\overbrace{Y_i}^{length} = \beta_0 + \beta_1 \overbrace{X_{i1}}^{width} + \beta_2 \overbrace{X_{i2}}^{sex} + \epsilon_i\]

  • What type of regression is stated above for \(Y_i\)? : Two-lines Model
feet2.lm <- lm(length ~ width + sex + sex:width, data = KidsFeet)

summary(feet2.lm)
## 
## Call:
## lm(formula = length ~ width + sex + sex:width, data = KidsFeet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6035 -1.0350  0.2049  0.7072  2.0209 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  10.9314     4.9118   2.226  0.03258 * 
## width         1.5423     0.5339   2.889  0.00659 **
## sexG         -1.1856     6.6054  -0.179  0.85858   
## width:sexG    0.1170     0.7328   0.160  0.87410   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.051 on 35 degrees of freedom
## Multiple R-squared:  0.4136, Adjusted R-squared:  0.3634 
## F-statistic: 8.229 on 3 and 35 DF,  p-value: 0.0002821
b <- coef(feet2.lm)

plot(length ~ width, data = KidsFeet, col= c("skyblue", "firebrick")[as.factor(sex)], pch = 19)

curve(b[1] + b[2]*x, col = "skyblue", lwd = 2, add = TRUE)

curve((b[1] + b[3]) + (b[2] + b[4])*x, col = "firebrick", lwd = 2, add = TRUE)


MODEL 3

Fit this model to the KidsFeet dataset. Use your model to answer the questions and re-create the graph shown below.

\[\overbrace{Y_i}^{length} = \beta_0 + \beta_1 \overbrace{X_{i1}}^{width} + \beta_2 X_{i1}^2 + \beta_3 \overbrace{X_{i2}}^{sex} + \beta_4 X_{i1}^2X_{i2} + \epsilon_i\]

  • What type of regression model is stated above for \(Y_i\)? : Double Quadratic Model
feet3 <- lm(length ~ width + I(width^2) + sex + I(width^2):sex, data = KidsFeet)

summary(feet3)
## 
## Call:
## lm(formula = length ~ width + I(width^2) + sex + I(width^2):sex, 
##     data = KidsFeet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6285 -0.9834  0.1947  0.6172  2.0940 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)     48.36370   66.71647   0.725    0.473
## width           -6.63499   14.54635  -0.456    0.651
## I(width^2)       0.44556    0.79217   0.562    0.577
## sexG            -2.59549    4.84822  -0.535    0.596
## I(width^2):sexG  0.03040    0.05963   0.510    0.614
## 
## Residual standard error: 1.062 on 34 degrees of freedom
## Multiple R-squared:  0.419,  Adjusted R-squared:  0.3507 
## F-statistic: 6.131 on 4 and 34 DF,  p-value: 0.0007948
plot(length ~ width, data = KidsFeet, col = c("skyblue", "firebrick")[as.factor(sex)], pch = 19)

b <- coef(feet3)

curve(b[1] + b[2]*x + b[3]*x^2, lwd = 2, add= TRUE, col = "skyblue")

curve((b[1]+b[4]) + (b[2])*x + (b[3] + b[5])*x^2, lwd = 2, add = TRUE, col = "firebrick")


MODEL 4

Fit this model to the KidsFeet dataset. Use your model to answer the questions and re-create the graph shown below.

\[\overbrace{Y_i}^{length} = \beta_0 + \beta_1 \overbrace{X_{i1}}^{width} + \epsilon_i\]

  • What type of regression model is stated above for \(Y_i\)? : Simple Linear Model
feet4 <- lm(length ~ width, data = KidsFeet)

summary(feet4)
## 
## Call:
## lm(formula = length ~ width, data = KidsFeet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6331 -1.0372  0.2299  0.7128  1.9642 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.8172     2.9381   3.341  0.00192 ** 
## width         1.6576     0.3262   5.081  1.1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.025 on 37 degrees of freedom
## Multiple R-squared:  0.411,  Adjusted R-squared:  0.3951 
## F-statistic: 25.82 on 1 and 37 DF,  p-value: 1.097e-05
plot(length ~ width, data = KidsFeet, pch = 19)

b <- coef(feet4)

curve(b[1] + b[2]*x, lwd = 2, add= TRUE)


Conclusion: - Which of the above models is the “best” model of the four based on the adjusted R-squared value? - Model 1 : Quadratic Model (Adjusted \(R^2\) = 0.3798) - Model 2 : Two-lines Model (Adjusted \(R^2\) = 0.3806) - Model 3 : Double Quadratic Model (Adjusted \(R^2\) = 0.3507) - Model 4 : Simple Linear Model (Adjusted \(R^2\) = 0.3951) - the BIGGEST adjusted \(R^2\) = “best” model


Problem 2

Part (a)

Create a scatterplot of the height of a tree on the age of the tree. Overlay both a simple linear regression line and a lowess curve, degree 1 on the plot. Does the lowess curve suggest that the line is a good fit or a poor fit?

tree.lm <- lm(height ~ age, data = Loblolly)

plot(height ~ age, data = Loblolly)

b <- coef(tree.lm)

curve(b[1] + b[2]*x, lwd = 2, add = TRUE)

# Have the x value first and the y value second!!
lines(lowess(Loblolly$age, Loblolly$height), col="firebrick")

  • The lowess curve shows that the line is not a very good fit. This is because the lowess curve seems to go through the means of each cluster of points, but the line seems to substantially miss the means of each cluster. The line often goes throught the max or min of a cluster, rather than the mean.


Part (b)

Re-draw the scatterplot, this time overlaying the curve from a quadratic regression model as well as the lowess curve of degree 1. Does this model look better or worse than the simple linear regression line when compared to the lowess curve?


Problem 3

Consider the ?Utilities data set. The goal of this question is to find a “best model” for predicting the monthly electricity bill, elecbill. Work through each of the parts below to do this.

Part (a)

Select (by clicking on it) the row of the pairs plot below that corresponds to elecbill being the y-variable in each of the plots. This is the main row you want to look at as you decide which variable is most useful in explaining y, the elecbill.

Which variable is “visibly” the “best” according to the pairs plot for predicting elecbill? Click on the correct plot and name the column. - kwh


Part (b)

Run a regression for the following model where X1i is the variable you identified in Part (a).

\[\underbrace{Y_i}_\text{elecbill} = \beta_0 + \beta_1\underbrace{X_{1i}}_\text{kwh} + \epsilon_i\]

State the p-value for the β1 term. State the R2 value of the regression. Create the residuals vs fitted values plot for the regression. What do each of these show?

shocked <- lm(elecbill ~ kwh, data = Utilities)

summary(shocked) %>% pander()
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.001 4.522 -1.106 0.2711
kwh 0.1088 0.005816 18.7 1.155e-36
Fitting linear model: elecbill ~ kwh
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
117 12.66 0.7525 0.7503
plot(shocked, which=1)

This plot shows that there is a possible problem in the linearity of the regression. A transformation may be in order. But first, let’s search for other useful variables.


Part (c)

Create a new pairs plot that includes the residuals and fitted values from the previous regression.

Hint: `cbind(R = lm1\(res, Fit = lm1\)fit, Utilities)

uill <- cbind(R = shocked$residuals, Fit = shocked$fitted.values, Utilities)

There are a lot of interesting patterns in the pairs plot at this point. Notice especially the plot for “month” in the first row of the pairs plot. Also, notice how month is strongly related to several variables in a quadratic way. Click on the scatterplots where month, acting as the x-variable, is a strong predictor of those variables.

Thus, while many variables could give some insight on the residuals, month seems to explain most of those variables as well, so that would suggest using month as a variable that could explain elecbill. Also, month would be the best choice to actually use over any of the other related variables, because it would be the easiest variable to “measure” in real life. However, there is one other variable that does not relate to month that also seems to shed some insight on the residuals from the regression from Part (b). - This variable is year as it shows a low correlation, but positive linear relationship and is not related ot month at all according to the pairs plot.


Part (d)

Perform a second regression that adds both of these newly identified variables to the regression performed in Part (b). This should now be a regression of elecbill on three different explanatory variables, namely,

\[ \underbrace{Y_i}_\text{elecbill} = \beta_0 + \beta_1\underbrace{X_{1i}}_\text{kwh} + \beta_2 \underbrace{X_{2i}}_\text{month} + \beta_3 \underbrace{X_{3i}}_\text{year} + \epsilon_i \]

Report the p-values of each of the two new variables, as well as the adjusted \(R^2\) value of the regression, and create a residuals vs fitted values plot for the regression.

shocking <- lm(elecbill ~ kwh + month + year, data = Utilities)

summary(shocking)
## 
## Call:
## lm(formula = elecbill ~ kwh + month + year, data = Utilities)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.6383  -6.1213  -0.4424   6.5672  22.3545 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.496e+03  6.209e+02  -8.852 1.37e-14 ***
## kwh          1.004e-01  4.973e-03  20.179  < 2e-16 ***
## month        3.716e-01  2.933e-01   1.267    0.208    
## year         2.741e+00  3.099e-01   8.845 1.43e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.811 on 113 degrees of freedom
## Multiple R-squared:  0.854,  Adjusted R-squared:  0.8501 
## F-statistic: 220.3 on 3 and 113 DF,  p-value: < 2.2e-16

It is not surprising that the month term is not significant because it has a quadratic, not linear, relationship with the residuals. We will keep the month term in the model until we add the quadratic month term as well, then decide if we should remove the month term or not.

The adjusted \(R^2\) of the regression is
0.854. This shows an even tighter fit than the original regression in Part (b), so our model is getting better.

plot(shocking, which=1)

This plot continues to suggest a possible y transformation to correct the constant variance problem. However, before we do that, let’s try adding in the quadratic month term.


Part (e)

Add to the regression model of Part (d) a quadratic month term. The model is now given by

\[ \underbrace{Y_i}_\text{elecbill} = \beta_0 + \beta_1\underbrace{X_{1i}}_\text{kwh} + \beta_2 \underbrace{X_{2i}}_\text{month} + \beta_3 \underbrace{X_{3i}}_\text{year} + \beta_4 \underbrace{X_{2i}^2}_\text{month^2} + \epsilon_i \]

Note how the p-values of all terms have now changed. State the adjusted R2 value of the regression. Create a residuals vs fitted values plot of the regression.

What does all of this show?

shockz <- lm(elecbill ~ kwh + month + year + I(month^2), data = Utilities)

summary(shockz)
## 
## Call:
## lm(formula = elecbill ~ kwh + month + year + I(month^2), data = Utilities)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.347  -3.995  -0.183   4.443  23.561 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.327e+03  4.874e+02 -10.931  < 2e-16 ***
## kwh          1.106e-01  4.085e-03  27.083  < 2e-16 ***
## month        8.118e+00  9.432e-01   8.607 5.31e-14 ***
## year         2.644e+00  2.433e-01  10.867  < 2e-16 ***
## I(month^2)  -6.072e-01  7.171e-02  -8.468 1.10e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.694 on 112 degrees of freedom
## Multiple R-squared:  0.911,  Adjusted R-squared:  0.9078 
## F-statistic: 286.5 on 4 and 112 DF,  p-value: < 2.2e-16

The residual plot ironically now shows a problem with linearity and constant variance.

plot(shockz, which=1)


Part(f)

Let’s check one more time if any other variables should be added to the model before we try a transformation.

Create another pairs plot that uses the residuals and fitted values from the regression in Part (d). What do you see?

Hint: Look carefully at the residuals vs month, residuals vs year, and residuals vs temp plots. While residuals vs ccf and residuals vs ThermsPerDay also look interesting, these variables would be harder to measure than month, year, and temp. So let’s consider variables that are easiest to measure first.


Part (g)

Perform a regression that includes temp and I(temp^2) into the regression from Part (e). So the model now becomes

\[ \underbrace{Y_i}_\text{elecbill} = \beta_0 + \beta_1\underbrace{X_{1i}}_\text{kwh} + \beta_2 \underbrace{X_{2i}}_\text{month} + \beta_3 \underbrace{X_{3i}}_\text{year} + \beta_4 \underbrace{X_{2i}^2}_\text{month^2} + \beta_5 \underbrace{X_{4i}}_\text{temp} + \beta_6 \underbrace{X_{4i}^2}_\text{temp^2} + \epsilon_i \]

plzimded <- lm(elecbill ~ kwh + month + year + I(month^2) + temp + I(temp^2), data = Utilities)

summary(plzimded)
## 
## Call:
## lm(formula = elecbill ~ kwh + month + year + I(month^2) + temp + 
##     I(temp^2), data = Utilities)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.924  -2.982   0.109   3.552  23.950 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.399e+03  4.448e+02 -12.138  < 2e-16 ***
## kwh          1.051e-01  3.896e-03  26.985  < 2e-16 ***
## month        3.524e+00  2.676e+00   1.317 0.190623    
## year         2.690e+00  2.222e-01  12.108  < 2e-16 ***
## I(month^2)  -2.632e-01  1.929e-01  -1.364 0.175204    
## temp        -5.457e-01  2.369e-01  -2.304 0.023128 *  
## I(temp^2)    7.964e-03  2.055e-03   3.876 0.000181 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.019 on 110 degrees of freedom
## Multiple R-squared:  0.9272, Adjusted R-squared:  0.9233 
## F-statistic: 233.6 on 6 and 110 DF,  p-value: < 2.2e-16


Part(h)

Drop month and I(month^2) from the model in Part (g) but leave all other terms in the model. This reduces the model to

\[ \underbrace{Y_i}_\text{elecbill} = \beta_0 + \beta_1\underbrace{X_{1i}}_\text{kwh} + \beta_3 \underbrace{X_{3i}}_\text{year} + \beta_5 \underbrace{X_{4i}}_\text{temp} + \beta_6 \underbrace{X_{4i}^2}_\text{temp^2} + \epsilon_i \]

bbqchik <- lm(elecbill ~ kwh + year+ temp + I(temp^2), data = Utilities)

summary(bbqchik)
## 
## Call:
## lm(formula = elecbill ~ kwh + year + temp + I(temp^2), data = Utilities)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.706  -2.700   0.208   3.705  23.820 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.462e+03  4.354e+02 -12.544  < 2e-16 ***
## kwh          1.029e-01  3.328e-03  30.917  < 2e-16 ***
## year         2.723e+00  2.173e-01  12.530  < 2e-16 ***
## temp        -3.780e-01  1.814e-01  -2.084 0.039481 *  
## I(temp^2)    7.466e-03  1.930e-03   3.868 0.000184 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.023 on 112 degrees of freedom
## Multiple R-squared:  0.9258, Adjusted R-squared:  0.9232 
## F-statistic: 349.5 on 4 and 112 DF,  p-value: < 2.2e-16
plot(bbqchik, which = 1)

It looks like temp gives more insight than does month. So even though we chose month earlier in our model search process, it turns out that temp is the better variable in the end. So we dropped month and switched to temp.


Part(i)

Though we could check another pairs plot at this point (it actually doesn’t show anything useful), let’s try removing the outlier identified in Part (h) and a y-transformation instead. Run boxCox(…) on your lm from Part (h) after redoing the regression with the outlier removed. Which value of λ is suggested?

boxCox(bbqchik)

Any value of λ between roughly
0.5 and 1 is a possible value. Thus, leaving our regression as is would be acceptable. And, since we are tired at this point, that sounds like a great idea.


Part (j)

State the equation for the final regression model for predicting elecbill. Be sure observation 95 is removed, i.e., data=Utilities[-95,]…

burntbbq <- lm(elecbill ~ kwh + year+ temp + I(temp^2), data = Utilities[-95,])

summary(burntbbq)
## 
## Call:
## lm(formula = elecbill ~ kwh + year + temp + I(temp^2), data = Utilities[-95, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5193  -3.7173  -0.1233   3.1444  24.6028 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.773e+03  3.665e+02 -15.752  < 2e-16 ***
## kwh          9.969e-02  2.818e-03  35.380  < 2e-16 ***
## year         2.880e+00  1.829e-01  15.744  < 2e-16 ***
## temp        -5.180e-01  1.529e-01  -3.389 0.000973 ***
## I(temp^2)    9.226e-03  1.632e-03   5.654 1.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.868 on 111 degrees of freedom
## Multiple R-squared:  0.9471, Adjusted R-squared:  0.9452 
## F-statistic: 496.6 on 4 and 111 DF,  p-value: < 2.2e-16
imdone <- lm(elecbill ~ year + I(month^2), data = Utilities)

summary(imdone)
## 
## Call:
## lm(formula = elecbill ~ year + I(month^2), data = Utilities)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.934 -14.270   0.354  16.742  44.315 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.029e+03  1.304e+03  -6.156 1.15e-08 ***
## year         4.037e+00  6.504e-01   6.208 8.96e-09 ***
## I(month^2)   1.961e-01  4.276e-02   4.587 1.16e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.1 on 114 degrees of freedom
## Multiple R-squared:  0.3187, Adjusted R-squared:  0.3068 
## F-statistic: 26.67 on 2 and 114 DF,  p-value: 3.157e-10



Class Activity - Expanding the Regression Model

Part 1 - Weighted Regressions

X <- c(1, 4,  7,   8,  10, 20)

Y <- c(3, 5, 18, 13, 12,   1)

 

w <- c(1, .5, .2, 1, 1, .1)

mylm <- lm(Y ~ X, weights=w)

plot(Y ~ X, pch=21, bg=rgb(1-w,1-w,1-w), col="orange")

abline(mylm)


Then, reproduce each of the graphics below by changing only the weights, w, and re-running the above codes of mylm, plot, and abline.

X <- c(1, 4,  7,   8,  10, 20)

Y <- c(3, 5, 18, 13, 12,   1)

 
# left to right 1st, 2nd, 3rd, 4th, 5th, and 6th dots effecting how much weight they have and how they effect the slope of the line! (numbers cannot be bigger than 1)
w <- c(1, 1, 0, 0, 0, 0)



mylm <- lm(Y ~ X, weights=w)

plot(Y ~ X, pch=21, bg=rgb(1-w,1-w,1-w), col="orange")

abline(mylm)


X <- c(1, 4,  7,   8,  10, 20)

Y <- c(3, 5, 18, 13, 12,   1)

 
# left to right 1st, 2nd, 3rd, 4th, 5th, and 6th dots effecting how much weight they have and how they effect the slope of the line! (numbers cannot be bigger than 1)
w <- c(0, 0, 1, 0, 0, 1)



mylm <- lm(Y ~ X, weights=w)

plot(Y ~ X, pch=21, bg=rgb(1-w,1-w,1-w), col="orange")

abline(mylm)


X <- c(1, 4,  7,   8,  10, 20)

Y <- c(3, 5, 18, 13, 12,   1)

 
# left to right 1st, 2nd, 3rd, 4th, 5th, and 6th dots effecting how much weight they have and how they effect the slope of the line! (numbers cannot be bigger than 1)
w <- c(1, 1, 0.4, 1, 1, 0.2)



mylm <- lm(Y ~ X, weights=w)

plot(Y ~ X, pch=21, bg=rgb(1-w,1-w,1-w), col="orange")

abline(mylm)


X <- c(1, 4,  7,   8,  10, 20)

Y <- c(3, 5, 18, 13, 12,   1)

 
# left to right 1st, 2nd, 3rd, 4th, 5th, and 6th dots effecting how much weight they have and how they effect the slope of the line! (numbers cannot be bigger than 1)
w <- c(0.3, 0.4, 1, 1, 1, 1)



mylm <- lm(Y ~ X, weights=w)

plot(Y ~ X, pch=21, bg=rgb(1-w,1-w,1-w), col="orange")

abline(mylm)


Part 2 - Loess Curves: How they Work

Change the code so that quadratic regression curves are used to fit the loess curve instead of lines.

# Get some data for X and Y
X <- cars$speed
Y <- cars$dist

# Ensure the data has no missing values in any of the X,Y pairs
keep <- which(!is.na(X) & !is.na(Y))
X <- X[keep]
Y <- Y[keep]

# Decide on the fraction of points to use for the local regressions
f <- 1/2

# Identify the total sample size and sample size corresponding to f
n <- length(X) #total sample size
nn <- floor(n*f) #number of dots corresponding to percentage f

# Create storage for the lowess fitted-values
lfit <- rep(NA,n)

# Begin the for loop to compute lowess fitted-values
for (xc in 1:n){ #let each dot be the center dot, one at a time
  xdists <- X - X[xc]  #compute distance from all X-values to center dot x-value
  r <- sort(abs(xdists))[nn] #locate the nn"th" largest x-distance from center x-value
  xdists.nbrhd <- which(abs(xdists) < r) #identify x-values within the neighborhood
  w <- rep(0, length(xdists)) #initialize the weights for all x-values at 0
  w[xdists.nbrhd] <- (1 - abs(xdists[xdists.nbrhd]/r)^3)^3 #tri-cubic weight function 
  plot(Y ~ X, pch=21, bg=rgb(.53,.81,.92, w), #color dots by their weights
       col=rgb(.2,.2,.2,.3), cex=1.5, yaxt='n', xaxt='n', xlab="", ylab="")
  points(Y[xc] ~ X[xc], pch=16, col="orange") #add center dot to plot
  lmc <- lm(Y ~ X, weights=w) #run weighted regression
  curve(lmc$coef[1] + lmc$coef[2]*x, #draw line on neighborhood using from and to
        from=min(X[xdists.nbrhd]), to=max(X[xdists.nbrhd]), col="orange", add=TRUE)
  lines(lfit[1:xc] ~ X[1:xc], col="gray") #add lowess line up to current center dot
  
  #lines(lowess(X,Y), col=rgb(0.698,0.133,0.133,.2))
  cat("\n\n")
  readline(prompt=paste0("Center point is point #", xc, "... Press [enter] to continue..."))
  
  
  MADnotThereYet <- TRUE
  count <- 0
  while(MADnotThereYet){ #while loop will continue until "MadnotThereYet" becomes false
    
    readline(prompt=paste0("\n   Adjusting line to account for outliers in the y-direction... Press [enter] to continue..."))   
    
    # overwrite current regression to lighter color if new regression still needed:
    curve(lmc$coef[1] + lmc$coef[2]*x, from=min(X[xdists.nbrhd]), to=max(X[xdists.nbrhd]), col="wheat", add=TRUE)
    
    # update the weights
    MAD <- median(abs(lmc$res))
    resm <- lmc$res/(6*MAD)
    resm[resm>1] <- 1
    bisq <- (1-resm^2)^2
    w <- w*bisq
    obs <- coef(lmc)
    lmc <- lm(Y ~ X, weights=w)
    
    curve(lmc$coef[1] + lmc$coef[2]*x, from=min(X[xdists.nbrhd]), to=max(X[xdists.nbrhd]), col="orange", add=TRUE)
    
    #stopping criterion for the weighted regressions in y-direction
    count <- count + 1
    if ( (sum(abs(obs-lmc$coef))<.1) | (count > 3))
      MADnotThereYet <- FALSE
    
  }
  
  curve(lmc$coef[1] + lmc$coef[2]*x, from=min(X[xdists.nbrhd]), to=max(X[xdists.nbrhd]), col="green", add=TRUE)
  points(lmc$coef[1] + lmc$coef[2]*X[xc] ~ X[xc], pch=16, col="green")
  
  
  readline(prompt=paste0("\n   Use final line to get fitted value for this point... Press [enter] to continue to next point..."))
  
  lfit[xc] <- predict(lmc, data.frame(X=X[xc]))
  lines(lfit[1:xc] ~ X[1:xc], col="gray")
  
  
  if (xc == n){
    readline(prompt=paste0("\n  Press [enter] to see actual Lowess curve..."))
    lines(lowess(X,Y, f=f), col="firebrick")
    legend("topleft", bty="n", legend="Actual lowess Curve using lowess(...)", col="firebrick", lty=1)
  }
  
  
}

## 
## 
## Center point is point #1... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #2... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #3... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #4... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #5... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #6... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #7... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #8... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #9... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #10... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #11... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #12... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #13... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #14... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #15... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #16... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #17... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #18... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #19... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #20... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #21... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #22... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #23... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #24... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #25... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #26... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #27... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #28... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #29... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #30... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #31... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #32... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #33... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #34... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #35... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #36... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #37... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #38... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #39... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #40... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #41... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #42... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #43... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #44... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #45... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #46... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #47... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #48... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #49... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...

## 
## 
## Center point is point #50... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Adjusting line to account for outliers in the y-direction... Press [enter] to continue...
## 
##    Use final line to get fitted value for this point... Press [enter] to continue to next point...
## 
##   Press [enter] to see actual Lowess curve...



Weeks 8 & 9: Model Selection & Validation

Simulating a Complex Model

set.seed(123)
x <- runif(100, 0, 10)
y_true <- 1 + 2 * x - 0.5 * x^2 + rnorm(100, 0, 2)
y_observed <- y_true
df <- data.frame(x = x, y = y_observed)

model_quad <- lm(y ~ poly(x, 2), data = df)
summary(model_quad)
## 
## Call:
## lm(formula = y ~ poly(x, 2), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3911 -1.2490 -0.0851  1.1498  4.4525 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5.5849     0.1944  -28.73   <2e-16 ***
## poly(x, 2)1 -86.5435     1.9437  -44.52   <2e-16 ***
## poly(x, 2)2 -34.2285     1.9437  -17.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.944 on 97 degrees of freedom
## Multiple R-squared:  0.9594, Adjusted R-squared:  0.9586 
## F-statistic:  1146 on 2 and 97 DF,  p-value: < 2.2e-16
plot(x, y, pch = 19, main = "True vs Estimated Model")
curve(1 + 2 * x - 0.5 * x^2, add = TRUE, col = "green", lwd = 2)
lines(sort(x), predict(model_quad)[order(x)], col = "red", lwd = 2)
legend("topright", legend = c("True Model", "Estimated Model"), col = c("green", "red"), lwd = 2)

Train/Test Split

set.seed(321)
train_indices <- sample(1:nrow(df), 0.7 * nrow(df))
train <- df[train_indices, ]
test <- df[-train_indices, ]

model_train <- lm(y ~ poly(x, 2), data = train)
preds <- predict(model_train, newdata = test)
mse <- mean((test$y - preds)^2)
mse
## [1] 3.522266

Week 10 & 11: Real Data Regression Models

Use real data (e.g., mtcars)

model_real <- lm(mpg ~ wt + hp + factor(cyl), data = mtcars)
summary(model_real)
## 
## Call:
## lm(formula = mpg ~ wt + hp + factor(cyl), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2612 -1.0320 -0.3210  0.9281  5.3947 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.84600    2.04102  17.563 2.67e-16 ***
## wt           -3.18140    0.71960  -4.421 0.000144 ***
## hp           -0.02312    0.01195  -1.934 0.063613 .  
## factor(cyl)6 -3.35902    1.40167  -2.396 0.023747 *  
## factor(cyl)8 -3.18588    2.17048  -1.468 0.153705    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.44 on 27 degrees of freedom
## Multiple R-squared:  0.8572, Adjusted R-squared:  0.8361 
## F-statistic: 40.53 on 4 and 27 DF,  p-value: 4.869e-11
pairs(mtcars[, c("mpg", "wt", "hp", "cyl")])


Week 12: Logistic Regression

Fit a Logistic Model

set.seed(100)
x <- rnorm(100)
z <- 1.5 * x - 0.75
p <- 1 / (1 + exp(-z))
y <- rbinom(100, 1, p)

log_data <- data.frame(x = x, y = factor(y))
model_logit <- glm(y ~ x, data = log_data, family = binomial)
summary(model_logit)
## 
## Call:
## glm(formula = y ~ x, family = binomial, data = log_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.8835     0.2788  -3.169  0.00153 ** 
## x             1.9267     0.4123   4.673 2.96e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 130.684  on 99  degrees of freedom
## Residual deviance:  88.394  on 98  degrees of freedom
## AIC: 92.394
## 
## Number of Fisher Scoring iterations: 5

Predict Probabilities and Classify

prob <- predict(model_logit, type = "response")
pred_class <- ifelse(prob > 0.5, 1, 0)
table(Predicted = pred_class, Actual = y)
##          Actual
## Predicted  0  1
##         0 55 17
##         1  9 19



Skill Quiz - Logistic Regression

Problem 1

  • Use the KidsFeet data set from library(mosaic) to practice fitting a logistic regression.

** Part (a)**

Is birthmonth enough information to predict the birthyear of a fourth grade student? Fit a logistic regression of birthyear == 88 on the birthmonth of a child to find out.

Enter the values of your estimates for \(\beta_0\) and \(\beta_1\) in the logistic model:

\[ P(\text{birthyear}_i = 88 \, | \, \text{birthmonth}_i) = \frac{e^{\beta_0 + \beta_1 \text{birthmonth}_i}}{1 + e^{\beta_0 + \beta_1 \text{birthmonth}_i}} \]

babies <- KidsFeet %>%
  mutate(year88 = case_when(
    birthyear %in% 88 ~ 1,
    birthyear %in% 87 ~ 0))

babe <- glm(year88 ~ birthmonth, data = babies, family=binomial)
summary(babe)
## 
## Call:
## glm(formula = year88 ~ birthmonth, family = binomial, data = babies)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   9.9665     3.8307   2.602  0.00928 **
## birthmonth   -0.9992     0.3977  -2.513  0.01198 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 36.708  on 38  degrees of freedom
## Residual deviance: 16.921  on 37  degrees of freedom
## AIC: 20.921
## 
## Number of Fisher Scoring iterations: 7

Part (b)

Fit a new model using only the width of the child’s foot instead of birthmonth to predict the birthyear of the child. Is foot width a better or worse predictor of the birthyear of the child than birthmonth?

State the AIC, null deviance, and residual deviance of both logistic regression models.

baby <- glm(year88 ~ width, data = babies, family=binomial)
summary(baby)
## 
## Call:
## glm(formula = year88 ~ width, family = binomial, data = babies)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  15.5140     9.0614   1.712   0.0869 .
## width        -1.5363     0.9824  -1.564   0.1179  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 36.708  on 38  degrees of freedom
## Residual deviance: 33.877  on 37  degrees of freedom
## AIC: 37.877
## 
## Number of Fisher Scoring iterations: 5

Part (c)

Use the better model of the two models from parts (a) and (b) to predict the probability that a fourth grade boy (from the same era as the students in this data) was born in 88 given they had a foot width of 9.5 cm, a foot length of 24, and was born in May. (This is more information than you need to make the prediction.)

predict(babe, newdata= data.frame(birthmonth = 2), type = "response")
##         1 
## 0.9996538

Part (d)

Interpret the effect on the odds that the slope term from the model used in Part (c) has on the odds of a fourth grade child being born in 88.

1-exp(-0.9992)
## [1] 0.6318261

Answer: “The odds of the child being born in 88 decrease by 63% for each one month increase in the birthmonth (1 being January and 12 being December). In other words, a child born in February has an odds of being born in 88 that is 63% less than a child that was born in January. And the same is true for any two consecutive months.”

Problem 2

Consider the ?RailTrail data set in library(mosaicData). As shown by the boxplot below, the number of users on the rail trail seems to be influenced by whether or not there was any precipitation on that day.

boxplot(volume ~ precip>0, data=RailTrail, col="skyblue", xlab="90 Days of Records from April to November", ylab="Total Number of Users on the Trail that Day", main="Rain Seems to Reduce the Number of Users", names = c("Days with No Rain", "Rainy Days"))

The goal of this question is to identify a logistic regression model that could be used to predict if there will be rain on a given day or not.

Run the following commands to reduce the data to variables of interest for this problem.

data("RailTrail")

# Use dplyr::select to FORCE the correct version
RT <- RailTrail %>%
  mutate(precipOccurred = (precip > 0)) %>%
  dplyr::select(precipOccurred, lowtemp, spring, summer, fall, cloudcover, weekday)

pairs(RT)

Part (a)

Create a pairs plot that shows how well each variable in the RT data set can explain whether or not precipitation occurred on a given day. State which variables seem to be the strongest predictors of precipitation occurring.

(Note that a major limitation of this data is that all measurements are on the “day of” the precipitation. The data could potentially be more insightful if it contained the temperature or other information for the day prior, or two days prior, or so on… as well.)

#Hint: pairs(..., pch=16, col=rgb(.2,.2,.2,.1))
#Or: pairs(..., pch=16, col=as.factor(yourData$yourYvariable))

cloudcover variable looks to be a good variable to predict precipOccurred

Part (b)

Fit three logistic regression models. The first should use lowtemp to predict precipitation. The second should use cloudcover to predict precipitation. The third should use fall to predict precipitation.

Compare the AIC and residual deviance of each model, which one is “better”?

cool <- glm(precipOccurred ~ lowtemp, data = RT, family=binomial)
summary(cool)
## 
## Call:
## glm(formula = precipOccurred ~ lowtemp, family = binomial, data = RT)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -2.32489    0.96286  -2.415   0.0158 *
## lowtemp      0.03377    0.01967   1.717   0.0860 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 113.14  on 89  degrees of freedom
## Residual deviance: 110.09  on 88  degrees of freedom
## AIC: 114.09
## 
## Number of Fisher Scoring iterations: 4
cloudy <- glm(precipOccurred ~ cloudcover, data = RT, family=binomial)
summary(cloudy)
## 
## Call:
## glm(formula = precipOccurred ~ cloudcover, family = binomial, 
##     data = RT)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.8900     1.0927  -4.475 7.63e-06 ***
## cloudcover    0.6081     0.1404   4.330 1.49e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 113.136  on 89  degrees of freedom
## Residual deviance:  78.098  on 88  degrees of freedom
## AIC: 82.098
## 
## Number of Fisher Scoring iterations: 5
pumpkinspice <- glm(precipOccurred ~ fall, data = RT, family=binomial)
summary(pumpkinspice)
## 
## Call:
## glm(formula = precipOccurred ~ fall, family = binomial, data = RT)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -0.6931     0.2402  -2.886   0.0039 **
## fall         -0.4055     0.7086  -0.572   0.5672   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 113.14  on 89  degrees of freedom
## Residual deviance: 112.79  on 88  degrees of freedom
## AIC: 116.79
## 
## Number of Fisher Scoring iterations: 4

Part (c)

Try every possible 2-variable (no interaction term) logistic regression model that uses both (1) the best variable from Part (b), and (2) each of the other variables in the RT data set, one at a time. Note the AIC of each model. The best two-variable model for this data is has an AIC of 82.75. State the model.

What is the p-value of both variables in this model? Which p-value is not significant?

What is the AIC of this model? Is it better or worse than the “best” one-variable model from Part (b)?

whoscontrollingthisthang <- glm( precipOccurred ~ cloudcover + weekday, data = RT, family=binomial)

summary(whoscontrollingthisthang)
## 
## Call:
## glm(formula = precipOccurred ~ cloudcover + weekday, family = binomial, 
##     data = RT)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.5215     1.1073  -4.083 4.44e-05 ***
## cloudcover    0.6322     0.1430   4.420 9.89e-06 ***
## weekdayTRUE  -0.7471     0.6521  -1.146    0.252    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 113.136  on 89  degrees of freedom
## Residual deviance:  76.753  on 87  degrees of freedom
## AIC: 82.753
## 
## Number of Fisher Scoring iterations: 5

However, the p-value for the weekday variable is not significant in this model. Also, it is interesting to note that the AIC of this mode is higher than the one-variable model that just uses cloudcover to predict precipitation occurring. This is a good witness that the AIC helps pretect against over-fitting models.


Week 13: Outliers and Robust Regression

Cook’s Distance

cooksd <- cooks.distance(model)
plot(cooksd, type = "h", main = "Cook's Distance")
abline(h = 4 / length(y), col = "red")

Robust Regression

model_robust <- rlm(y ~ x)
summary(model_robust)
## 
## Call: rlm(formula = y ~ x)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65018 -0.30879 -0.04683  0.27767  0.85091 
## 
## Coefficients:
##             Value  Std. Error t value
## (Intercept) 0.3500 0.0443     7.8950 
## x           0.2818 0.0436     6.4551 
## 
## Residual standard error: 0.4324 on 98 degrees of freedom



THE FINAL EXAM